Tuesday, 30 July 2019

What is Collation in Databases?

In database systems, Collation specifies how data is sorted and compared in a database. Collation provides the sorting rules, case, and accent sensitivity properties for the data in the database.
For example, when you run a query using the ORDER BY clause, collation determines whether or not uppercase letters and lowercase letters are treated the same.
Collation is also used to determine how accents are treated, as well as character width and Japanese kana characters. Collation can also be used to distinguish between various ideographic variation selectors in certain collations (such as the Japanese_Bushu_Kakusu_140 and Japanese_XJIS_140 collations that were introduced in SQL Server 2017).
Different database management systems will provide different collation options. Depending on the DBMS, collation can be specified at the server level, the database level, the table level, and the column level. Collations can also be specified at the expression level (so you can specify which collation to use when you run a query), and at the identifier level.

Example

Below is a comparison between running a case-sensitive query vs a case-insensitive query in SQL Server.

Case-insensitive

USE Music;
SELECT ArtistId, ArtistName
FROM Artists
ORDER BY ArtistName COLLATE Latin1_General_CI_AI;
This query uses the ORDER BY clause to order the results, and it uses the COLLATE clause to specify the collation to be used when ordering those results. In this case, we use the Latin1_General_CI_AI collation. Note that this collation has the letters CI, which stands for case-insensitive.
The above query returns the following result:
ArtistId  ArtistName            
--------  ----------------------
1         Aardvark              
2         AArdvark              
3         aArdvark              
4         aardvark  

Case-sensitive

However, if we change the collation to case-sensitive by using the Latin1_General_CS_AIcollation, we get the following result:
ArtistId  ArtistName            
--------  ----------------------
4         aardvark              
3         aArdvark              
1         Aardvark              
2         AArdvark  

Collation Options

As mentioned, case-sensitivity is just one of various options when specifying collation. The options associated with a collation are case sensitivity, accent sensitivity, Kana-sensitivity, width sensitivity, variation-selector-sensitivity.
Collations are expressed in different ways, depending on the DBMS that you use. In SQL Server, collation options are specified by appending the options to the collation name. So for example, Latin1_General_CI_AS is case-insensitive and accent-sensitive.
Below is a quick overview of the options:
OptionExplanation
Case-sensitive (_CS)Lowercase letters sort ahead of their uppercase versions.
Case-insensitive (_CI)Uppercase and lowercase versions of letters are considered identical for sorting purposes.
Accent-sensitive (_AS)Distinguishes between accented and unaccented characters. For example, a is not equal to .
Accent-insensitive (_AI)Accented and unaccented versions of letters are considered identical for sorting purposes. For example, a is equal to .
Kana-sensitive (_KS)Distinguishes between the Japanese kana characters Hiragana and Katakana. If _KS is omitted, then it is kana-insensitive, which means that Hiragana and Katakana characters are considered equal for sorting purposes.
Width-sensitive (_WS)Distinguishes between full-width and half-width characters. If _WS is omitted, then it is width-insensitive, which means that full-width and half-width representations of the same character are considered identical for sorting purposes.
Variation-selector-sensitive (_VSS)Distinguishes between various ideographic variation selectors in Japanese collations Japanese_Bushu_Kakusu_140 and Japanese_XJIS_140. These collations were introduced in SQL Server 2017. If _VSS is not selected, the collation is variation selector insensitive, and the variation selector is not considered in the comparison.
You should be able to see a list of available collations for your database by running a query. The syntax will depend on your database management system.  The list of available collations is usually quite large, due to the fact that each collation can have many different permutations of the above options.
For example, below is a list of the various Latin1 General collations available in SQL Server 2017:
Latin1_General_BINLatin1-General, binary sort
Latin1_General_BIN2Latin1-General, binary code point comparison sort
Latin1_General_CI_AILatin1-General, case-insensitive, accent-insensitive, kanatype-insensitive, width-insensitive
Latin1_General_CI_AI_WSLatin1-General, case-insensitive, accent-insensitive, kanatype-insensitive, width-sensitive
Latin1_General_CI_AI_KSLatin1-General, case-insensitive, accent-insensitive, kanatype-sensitive, width-insensitive
Latin1_General_CI_AI_KS_WSLatin1-General, case-insensitive, accent-insensitive, kanatype-sensitive, width-sensitive
Latin1_General_CI_ASLatin1-General, case-insensitive, accent-sensitive, kanatype-insensitive, width-insensitive
Latin1_General_CI_AS_WSLatin1-General, case-insensitive, accent-sensitive, kanatype-insensitive, width-sensitive
Latin1_General_CI_AS_KSLatin1-General, case-insensitive, accent-sensitive, kanatype-sensitive, width-insensitive
Latin1_General_CI_AS_KS_WSLatin1-General, case-insensitive, accent-sensitive, kanatype-sensitive, width-sensitive
Latin1_General_CS_AILatin1-General, case-sensitive, accent-insensitive, kanatype-insensitive, width-insensitive
Latin1_General_CS_AI_WSLatin1-General, case-sensitive, accent-insensitive, kanatype-insensitive, width-sensitive
Latin1_General_CS_AI_KSLatin1-General, case-sensitive, accent-insensitive, kanatype-sensitive, width-insensitive
Latin1_General_CS_AI_KS_WSLatin1-General, case-sensitive, accent-insensitive, kanatype-sensitive, width-sensitive
Latin1_General_CS_ASLatin1-General, case-sensitive, accent-sensitive, kanatype-insensitive, width-insensitive
Latin1_General_CS_AS_WSLatin1-General, case-sensitive, accent-sensitive, kanatype-insensitive, width-sensitive
Latin1_General_CS_AS_KSLatin1-General, case-sensitive, accent-sensitive, kanatype-sensitive, width-insensitive
Latin1_General_CS_AS_KS_WSLatin1-General, case-sensitive, accent-sensitive, kanatype-sensitive, width-sensitive
Latin1_General_100_BINLatin1-General-100, binary sort
Latin1_General_100_BIN2Latin1-General-100, binary code point comparison sort
Latin1_General_100_CI_AILatin1-General-100, case-insensitive, accent-insensitive, kanatype-insensitive, width-insensitive
Latin1_General_100_CI_AI_WSLatin1-General-100, case-insensitive, accent-insensitive, kanatype-insensitive, width-sensitive
Latin1_General_100_CI_AI_KSLatin1-General-100, case-insensitive, accent-insensitive, kanatype-sensitive, width-insensitive
Latin1_General_100_CI_AI_KS_WSLatin1-General-100, case-insensitive, accent-insensitive, kanatype-sensitive, width-sensitive
Latin1_General_100_CI_ASLatin1-General-100, case-insensitive, accent-sensitive, kanatype-insensitive, width-insensitive
Latin1_General_100_CI_AS_WSLatin1-General-100, case-insensitive, accent-sensitive, kanatype-insensitive, width-sensitive
Latin1_General_100_CI_AS_KSLatin1-General-100, case-insensitive, accent-sensitive, kanatype-sensitive, width-insensitive
Latin1_General_100_CI_AS_KS_WSLatin1-General-100, case-insensitive, accent-sensitive, kanatype-sensitive, width-sensitive
Latin1_General_100_CS_AILatin1-General-100, case-sensitive, accent-insensitive, kanatype-insensitive, width-insensitive
Latin1_General_100_CS_AI_WSLatin1-General-100, case-sensitive, accent-insensitive, kanatype-insensitive, width-sensitive
Latin1_General_100_CS_AI_KSLatin1-General-100, case-sensitive, accent-insensitive, kanatype-sensitive, width-insensitive
Latin1_General_100_CS_AI_KS_WSLatin1-General-100, case-sensitive, accent-insensitive, kanatype-sensitive, width-sensitive
Latin1_General_100_CS_ASLatin1-General-100, case-sensitive, accent-sensitive, kanatype-insensitive, width-insensitive
Latin1_General_100_CS_AS_WSLatin1-General-100, case-sensitive, accent-sensitive, kanatype-insensitive, width-sensitive
Latin1_General_100_CS_AS_KSLatin1-General-100, case-sensitive, accent-sensitive, kanatype-sensitive, width-insensitive
Latin1_General_100_CS_AS_KS_WSLatin1-General-100, case-sensitive, accent-sensitive, kanatype-sensitive, width-sensitive
Latin1_General_100_CI_AI_SCLatin1-General-100, case-insensitive, accent-insensitive, kanatype-insensitive, width-insensitive, supplementary characters
Latin1_General_100_CI_AI_WS_SCLatin1-General-100, case-insensitive, accent-insensitive, kanatype-insensitive, width-sensitive, supplementary characters
Latin1_General_100_CI_AI_KS_SCLatin1-General-100, case-insensitive, accent-insensitive, kanatype-sensitive, width-insensitive, supplementary characters
Latin1_General_100_CI_AI_KS_WS_SCLatin1-General-100, case-insensitive, accent-insensitive, kanatype-sensitive, width-sensitive, supplementary characters
Latin1_General_100_CI_AS_SCLatin1-General-100, case-insensitive, accent-sensitive, kanatype-insensitive, width-insensitive, supplementary characters
Latin1_General_100_CI_AS_WS_SCLatin1-General-100, case-insensitive, accent-sensitive, kanatype-insensitive, width-sensitive, supplementary characters
Latin1_General_100_CI_AS_KS_SCLatin1-General-100, case-insensitive, accent-sensitive, kanatype-sensitive, width-insensitive, supplementary characters
Latin1_General_100_CI_AS_KS_WS_SCLatin1-General-100, case-insensitive, accent-sensitive, kanatype-sensitive, width-sensitive, supplementary characters
Latin1_General_100_CS_AI_SCLatin1-General-100, case-sensitive, accent-insensitive, kanatype-insensitive, width-insensitive, supplementary characters
Latin1_General_100_CS_AI_WS_SCLatin1-General-100, case-sensitive, accent-insensitive, kanatype-insensitive, width-sensitive, supplementary characters
Latin1_General_100_CS_AI_KS_SCLatin1-General-100, case-sensitive, accent-insensitive, kanatype-sensitive, width-insensitive, supplementary characters
Latin1_General_100_CS_AI_KS_WS_SCLatin1-General-100, case-sensitive, accent-insensitive, kanatype-sensitive, width-sensitive, supplementary characters
Latin1_General_100_CS_AS_SCLatin1-General-100, case-sensitive, accent-sensitive, kanatype-insensitive, width-insensitive, supplementary characters
Latin1_General_100_CS_AS_WS_SCLatin1-General-100, case-sensitive, accent-sensitive, kanatype-insensitive, width-sensitive, supplementary characters
Latin1_General_100_CS_AS_KS_SCLatin1-General-100, case-sensitive, accent-sensitive, kanatype-sensitive, width-insensitive, supplementary characters
Latin1_General_100_CS_AS_KS_WS_SCLatin1-General-100, case-sensitive, accent-sensitive, kanatype-sensitive, width-sensitive, supplementary characters
As you can see, none of the collations in the above list include the _VSS option, because that option is only applicable to certain Japanese collations. SQL Server 2017 includes 32 collations with this option (such as Japanese_Bushu_Kakusu_140_CI_AI_KS_WS_VSS for example).

0 comments:

Post a Comment