Tuesday, 17 July 2018

Saving Emoticons/Unicode from Twitter to a MySQL Database

Saving Emoticons/Unicode from Twitter to a MySQL Database


The last couple of days I’ve been developing a little app that imports Twitter feeds and saves tweets to a MySQL database; however, I quickly hit an issue where certain tweets would not save. MySQL was returning the error message ‘SQLSTATE[HY000]: General error: 1366 Incorrect string value’. Looking at the specific tweets causing the database to throw the error I noticed that they contained emoticons.

The Problem

I was using utf8 encoding in MySQL which I thought (wrongly) could represent all unicode characters.
The unicodes for emoticons are fully supported by the UTF-8 encoding; however, MySQL’s utf8 does not! To save emoticons to a MySQL database we need to use utf8mb4.
The difference between MySQL’s utf8 and utf8mb4 is that the former can only store 3 byte characters whereas the latter can store 4 byte ones. Therefore with utf8 we can only store unicode characters from the Basic Multilingual Plane. Put more simply, utf8is suitable for characters from the majority of modern languages and some symbols. Emoticon characters exist in the Supplementary Multilingual Plane for which we need to use utf8mb4.

The Fix

In order to save emoticons and other unicode not supported by MySQL’s utf8 we need to change the character set and collation properties of our database, tables and columns. As long as our database is currently using utf8 there should be no risk of data loss as utf8mb4 is fully backwards compatible with it. However, it is still a good idea to make a backup of the database before running the following commands.
Let’s first set the connection encoding to utf8mb4 and then change the database’s character set and collation to it:-
SET NAMES utf8mb4;
ALTER DATABASE database_name CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;
(Replacing database_name with the name of the database.)
Next we need to convert the relevant tables to utf8mb4/utf8mb4_unicode_ci (you will need to run this for each table):-
ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
(Replacing table_name with the name of the database table.)
Finally we need to update the character set and collation for the column(s):-
ALTER TABLE table_name CHANGE column_name column_name VARCHAR(140) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL;
Read that last command carefully as it will need modifying to your specific needs. The important bit to note is the CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci. (As with the other commands you need to replace table_name and column_nameappropriately.)
Bear in mind that changing from utf8 to utfmb4 affects the maximum lengths of columns and index keys. The maximum lengths of columns will be unchanged when modifying the tables with the commands discussed here; but in terms of actual characters the amount that can be stored is reduced. This is a consequence of changing from being able to store only 3 bytes up to 4! You may need to update these lengths if this is relevant to you.
It may be a good idea to make sure every table that has been converted is repaired and optimised after making the above changes. You can run the following for each modified table:-
REPAIR TABLE table_name;
OPTIMIZE TABLE table_name;
Alternatively, use mysqlcheck to repair and optimises everything in one go:-
$ mysqlcheck -u root -p --auto-repair --optimize --all-databases

Finally

Everything should be ready now, but don’t forget to update how your app connects to the database so that it uses the new encoding. For example, in CakePHP 3 the datasource’s encoding needs updating to utf8mb4:-
'Datasources' => [
    'default' => [
        ...
        'database' => 'database_name',
        'encoding' => 'utf8mb4',
        ...
    ]
]
We should now be good to go and start saving tweets containing emoticons (or anything else using them).

0 comments:

Post a Comment