Wednesday, 14 November 2018

Mysql: What are the diffrences between utf8_general_ci and utf8_unicode_ci?


I've got two options for unicode that look promising for a mysql database.
utf8_general_ci unicode (multilingual), case-insensitive
utf8_unicode_ci unicode (multilingual), case-insensitive
Can you please explain what is the difference between utf8_general_ci and utf8_unicode_ci? 
What are the effects of choosing one over the other when designing a database?

 Answers


utf8_general_ci is a very simple — and on Unicode, very broken — collation, 
one that gives incorrect results on general Unicode text. What it does is:
  • converts to Unicode normalization form D for canonical decomposition
  • removes any combining characters
  • converts to upper case
This does not work correctly on Unicode, because it does not understand Unicode casing. 
Unicode casing alone is much more complicated than an ASCII-minded approach can 
handle. For example:
  • The lowercase of “ẞ” is “ß”, but the uppercase of “ß” is “SS”.
  • There are two lowercase Greek sigmas, but only one uppercase one; 
  • consider “Σίσυφος”.
  • Letters like “ø” do not decompose to an “o” plus a diacritic, meaning
  • that it won’t correctly sort.
There are many other subtleties.
  1. utf8_unicode_ci uses the standard Unicode Collation Algorithm
  2. supports so called expansions and ligatures, for example: German letter ß
    (U+00DF LETTER SHARP S) is sorted near "ss" Letter Œ (U+0152 LATIN CAPITAL
    LIGATURE OE) is sorted near "OE".
utf8_general_ci does not support expansions/ligatures, it sorts all these letters as 
single characters, and sometimes in a wrong order.
  1. utf8_unicode_ci is generally more accurate for all scripts. For example, 
  2. on Cyrillic block: utf8_unicode_ci is fine for all these languages: Russian,
    Bulgarian, Belarusian, Macedonian, Serbian, and Ukrainian. While utf8_general_ci
    is fine only for Russian and Bulgarian subset of Cyrillic. Extra letters used in
    Belarusian, Macedonian, Serbian, and Ukrainian are sorted not well.
The cost of utf8_unicode_ci is that it is a little bit slower than utf8_general_ci.
 But that’s the price you pay for correctness. Either you can have a fast answer that’s wrong,
 or a very slightly slower answer that’s right. Your choice. It is very difficult to ever justify 
giving wrong answers, so it’s best to assume that utf8_general_ci doesn’t exist and to 
always use utf8_unicode_ci. Well, unless you want wrong answers.

0 comments:

Post a Comment