Wednesday, 14 November 2018

Convert latin1 to UTF-8 in MySQL

Since MySQL 4.1, UTF-8 is the default charset. If you have an old database, containing data encoded in latin1, and you want upgrade to a newer MySQL server, then you can do the following if you want all your data in UTF-8 instead:
Be careful when switching to UTF-8. Once you have converted your data, any program/webapp that uses the database will have to check that the data they are sending to the database is valid UTF-8. If it isn't then MySQL will silently truncate the data after the invalid part, which can cause all sorts of problems. If your program/webapp doesn't specifically say that it supports unicode then you may want to stick with latin1 instead.

MySQL dump

First of all, we need to dump the old data into a file.
mysqldump -h example.org --user=foo -p --default-character-set=latin1 -c \ --insert-ignore --skip-set-charset -r dump.sql dbname
Please note: You have to replace the user, the host and the dbname, otherwise it won't connect to your database.

Convert dump

Before the next step, it'd be useful to run the following command to check the current charset in your dump file
file dump.sql
If your file is already in UTF-8, you may need to skip the following iconv command.
Next thing to do is, converting the characters in the MySQL dump from latin1 to UTF-8
iconv -f ISO8859-1 -t UTF-8 dump.sql > dump_utf8.sql
If you see after your conversion, that umlauts in your database are converted correctly, but that the sign ß and € are broken, you might get it working by using -f CP1252 instead of -f ISO8859-1 in this command.
perl -pi -w -e 's/CHARSET=latin1/CHARSET=utf8/g;' dump_utf8.sql
There may be other places in your database, where latin-1 character set is used. For example, there may be lines in your dump similar to:
ar_title varchar(255) character set latin1 collate latin1_bin NOT NULL default
The best is to either grep your dump against the word "latin1" or look for those lines in some text editor (don't forget to use case-insensitive search).
If you have another source charset, you need to replace the -f option with your local character set.

Drop and create

Now it's time to drop the old database and create a new one with UTF-8 support.
mysql --user=foo -p --execute="DROP DATABASE dbname; CREATE DATABASE dbname CHARACTER SET utf8 COLLATE utf8_general_ci;"
(MySql seems to recommend utf8_unicode_ci over utf8_general_ci for 5.1 +, see http://dev.mysql.com/doc/refman/5.1/en/charset-unicode-sets.html)

Import dump to database

Last but not least, we need to import the converted data back to the new database.
mysql --user=foo --max_allowed_packet=16M -p --default-character-set=utf8 dbname < dump_utf8.sql
The max_allowed_packet option is sometimes important. If your import ends up with a "ERROR 1153 at line 42: Got a packet bigger than 'max_allowed_packet'", you need to increase the packet size. To do this, edit /etc/mysql/my.cnf and set max_allowed_packet=16M under the [mysqld] directive. This is because utf-8 coded characters will take more space than latin-1.
Dont forget to restart your mysql server.

Alternative Method

Instead of using the above method of dumping and reloading your data, you can use the ALTER TABLE/DATABASE SQL statements to convert your data.
For each table you want to convert to utf8, use the following statement.
ALTER TABLE <table_name> CONVERT TO CHARACTER SET utf8;
This will automatically convert all text columns to utf8.
To set the default character set for a database, use the following statement.
ALTER DATABASE <database_name> CHARACTER SET utf8;
This does not affect any existing tables, but any future tables created in this database will use utf8 by default.
"The CONVERT TO operation converts column values between the character sets. This is not what you want if you have a column in one character set (like latin1) but the stored values actually use some other, incompatible character set (like utf8). In this case, you have to do the following for each such column:
ALTER TABLE t1 CHANGE c1 c1 BLOB;
ALTER TABLE t1 CHANGE c1 c1 TEXT CHARACTER SET utf8;
The reason this works is that there is no conversion when you convert to or from BLOB columns."

Alternative Method 2

When the data has been double-encoded, none of the methods above will work. You will still have funny looking characters in the database. Someone provided the following solution on his blog and it works very well
mysqldump -h DB_HOST -u DB_USER -p DB_PASSWORD --opt --quote-names \ --skip-set-charset --default-character-set=latin1 DB_NAME > DB_NAME-dump.sql
mysql -h DB_HOST -u DB_USER -p DB_PASSWORD \ --default-character-set=utf8 DB_NAME < DB_NAME-dump.sql

0 comments:

Post a Comment