What is the best collation to use for MySQL with PHP? ~ Tech Blog

Friday, 2 November 2018

What is the best collation to use for MySQL with PHP?

I'm wondering if there is a "best" choice for collation in MySQL for a general website where you aren't 100% sure of what will be entered? I understand that all the encodings should be the same, such as MySQL, Apache, the HTML and anything inside PHP.

In the past I have set PHP to output in "UTF-8", but which collation does this match in MySQL? I'm thinking it's one of the UTF-8 ones, but I have used utf8_unicode_ci, utf8_general_ci, and utf8_bin before.

Answers

The main difference is sorting accuracy (when comparing characters in the language) and performance. The only special one is utf8_bin which is for comparing characters in binary format.

utf8_general_ci is somewhat faster than utf8_unicode_ci, but less accurate (for sorting). The specific language utf8 encoding (such as utf8_swedish_ci) contain additional language rules that make them the most accurate to sort for those languages. Most of the time I use utf8_unicode_ci (I prefer accuracy to small performance improvements), unless I have a good reason to prefer a specific language.

You can read more on specific unicode character sets on the MySQL manual - http://dev.mysql.com/doc/refman/5.0/en/charset-unicode-sets.html

Actually, you probably want to use utf8_unicode_ci or utf8_general_ci.

utf8_general_ci sorts by stripping away all accents and sorting as if it were ASCII
utf8_unicode_ci uses the Unicode sort order, so it sorts correctly in more languages

However, if you are only using this to store English text, these shouldn't differ.

Collations affect how data is sorted and how strings are compared to each other. That means you should use the collation that most of your users expect.

Example from the documentation:

utf8_general_ci also is satisfactory for both German and French, except that ‘ß’ is equal to ‘s’, and not to ‘ss’. If this is acceptable for your application, then you should use utf8_general_ci because it is faster. Otherwise, use utf8_unicode_ci because it is more accurate.

So - it depends on your expected user base and on how much you need correct sorting. For an English user base, utf8_general_ci should suffice, for other languages, like Swedish, special collations have been created.

For UTF-8 textual information, you should use utf8_general_ci because...

utf8_bin: compare strings by the binary value of each character in the string
utf8_general_ci: compare strings using general language rules and using case-insensitive comparisons

a.k.a. it will should making searching and indexing the data faster/more efficient/more useful.

For the case highlighted by Guus, I would strongly suggest using either utf8_unicode_cs (case sensitive, strict matching, ordering correctly for the most part) instead of utf8_bin (strict matching, incorrect ordering).

If the field is intended to be searched, as opposed to matched for a user, then use utf8_general_ci or utf8_unicode_ci. Both are case-insensitive, one will losely match (‘ß’ is equal to ‘s’, and not to ‘ss’). There are also language specific versions, like utf8_german_ci where the lose matching is more suitable for the language specified.

[Edit - nearly 6 years later]

I no longer recommend the "utf8" character set on MySQL, and instead recommend the "utf8mb4" character set. They match almost entirely, but allow for a little (lot) more unicode characters.

Realistically, MySQL should have updated the "utf8" character set and respective collations to match the "utf8" specification, but instead, a separate character set and respective collations as to not impact storage designation for those already using their incomplete "utf8" character set.

In your database upload file, add the followin line before any line:

SET NAMES utf8;

And your problem should be solved.

Tech Blog

Friday, 2 November 2018