Saving Emoticons/Unicode from Twitter to a MySQL Database
The last couple of days I’ve been developing a little app that imports Twitter feeds and saves tweets to a MySQL database; however, I quickly hit an issue where certain tweets would not save. MySQL was returning the error message ‘SQLSTATE[HY000]: General error: 1366 Incorrect string value’. Looking at the specific tweets causing the database to throw the error I noticed that they contained emoticons.
The Problem
I was using
utf8
encoding in MySQL which I thought (wrongly) could represent all unicode characters.
The unicodes for emoticons are fully supported by the UTF-8 encoding; however, MySQL’s
utf8
does not! To save emoticons to a MySQL database we need to use utf8mb4
.
The difference between MySQL’s
utf8
and utf8mb4
is that the former can only store 3 byte characters whereas the latter can store 4 byte ones. Therefore with utf8
we can only store unicode characters from the Basic Multilingual Plane. Put more simply, utf8
is suitable for characters from the majority of modern languages and some symbols. Emoticon characters exist in the Supplementary Multilingual Plane for which we need to use utf8mb4
.The Fix
In order to save emoticons and other unicode not supported by MySQL’s
utf8
we need to change the character set and collation properties of our database, tables and columns. As long as our database is currently using utf8
there should be no risk of data loss as utf8mb4
is fully backwards compatible with it. However, it is still a good idea to make a backup of the database before running the following commands.
Let’s first set the connection encoding to
utf8mb4
and then change the database’s character set and collation to it:-SET NAMES utf8mb4;
ALTER DATABASE database_name CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;
(Replacing
database_name
with the name of the database.)
Next we need to convert the relevant tables to utf8mb4/utf8mb4_unicode_ci (you will need to run this for each table):-
ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
(Replacing
table_name
with the name of the database table.)
Finally we need to update the character set and collation for the column(s):-
ALTER TABLE table_name CHANGE column_name column_name VARCHAR(140) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL;
Read that last command carefully as it will need modifying to your specific needs. The important bit to note is the
CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci
. (As with the other commands you need to replace table_name
and column_name
appropriately.)
Bear in mind that changing from
utf8
to utfmb4
affects the maximum lengths of columns and index keys. The maximum lengths of columns will be unchanged when modifying the tables with the commands discussed here; but in terms of actual characters the amount that can be stored is reduced. This is a consequence of changing from being able to store only 3 bytes up to 4! You may need to update these lengths if this is relevant to you.
It may be a good idea to make sure every table that has been converted is repaired and optimised after making the above changes. You can run the following for each modified table:-
REPAIR TABLE table_name;
OPTIMIZE TABLE table_name;
Alternatively, use mysqlcheck to repair and optimises everything in one go:-
$ mysqlcheck -u root -p --auto-repair --optimize --all-databases
Finally
Everything should be ready now, but don’t forget to update how your app connects to the database so that it uses the new encoding. For example, in CakePHP 3 the datasource’s encoding needs updating to
utf8mb4
:-'Datasources' => [
'default' => [
...
'database' => 'database_name',
'encoding' => 'utf8mb4',
...
]
]
We should now be good to go and start saving tweets containing emoticons (or anything else using them).
0 comments:
Post a Comment