Tuesday, 7 August 2018

MySQL Truncating Value At First Supplementary / Astral / utf8mb4 Character On INSERT And UPDATE

As I've been embracing [one of] my roles as the Database Administrator's understudy at InVision App, I've been trying hard to track down every vexing problem relating to our MySQL RDS instances. For months, we've been having a problem where data, coming out of a Node.js microservice, was being truncated on the first character that required the utf8mb4 character set. While I don't yet understand why our ColdFusion microservices and our Node.js microservices are handling this differently (I suspect the Java driver is just better than the Node.js driver), I think I finally figured out why Node.js is able to insert corrupted data: we're using a lenient "sql_mode" setting.
For context, this is for data going into a database table that only supports the utf8 character set, not the utf8mb4 character set (which requires up to four bytes for each code-point, using a pair of high / low surrogate characters). When the ColdFusion MySQL 5 driver goes to insert data that contains supplementary / astral plane characters into this table, it errors out with a message like:
Error Executing Database Query. Incorrect string value: '\xF0\x9F\x98\x8D, ...' for column 'value' at row 1.
Now, on the other hand, when the Node.js MySQL driver goes to insert the same data, there is no error. The INSERT succeeds; but, the data becomes corrupted, truncating the input at the same character that would have caused an error in the ColdFusion MySQL driver.
NOTE: The fact that we have two "microservices" accessing the same database is not lost on me; it makes me sad.
I don't know why the ColdFusion driver works better than the Node.js driver (except for the fact that ColdFusion is amazing). It could be something in the connection string. But, even when I have what I believe to be the same connection string in both drivers, the problem still persists.
That said, after much head-banging, experimentation, and Googling, I finally came across the blog post, "Data Corruption To Go: The Perils Of sql_mode = NULL," by Keyur Govande. This post cued me in on the concept of "sql_mode," which I had never heard of before. SQL modes affect the SQL syntax that MySQL supports and the data validation checks that it performs during mutation events.
By default, MySQL uses a sql_mode named "NO_ENGINE_SUBSTITUTION". This is a non-strict mode in which problematic data produces "warnings" not "erros". This is what one of our MySQL RDS instances is using. And, it's the reason data is being corrupted. To see this in action, we can set the current session to use NO_ENGINE_SUBSTITUTION and then try to insert problematic data:
-- This mode produces warnings, not errors for certain problematic
-- operations.
SET SESSION sql_mode = 'NO_ENGINE_SUBSTITUTION';
-- Try to insert data that requires the utf8mb4 character set (into
-- a table that only supports utf8).
INSERT INTO
char_test
SET
id = 100,
value = 'ha ha <?> you crazy!'
;
show warnings;
-- Check to see what value was inserted.
SELECT
value
FROM
char_test
WHERE
id = 100
;

When we run the above SQL, we get the following output:

   
  sql_mode NO_ENGINE_SUBSTITUTION 
   
As you can see, the inserted value is being truncated on the first occurrence of a character that requires utf8mb4. And, if run:
SHOW WARNINGS;

... we get the following:
Warning 1300 - Invalid utf8 character string: 'F09F98'
Warning 1366 -Incorrect string value: '\xF0\x9F\x98\x81 y...' for column 'value' at row 1
In the non-strict "NO_ENGINE_SUBSTITUTION" mode, we get warnings about the problem; but, that doesn't prevent the INSERT for taking place.
Now, if we switch over to "STRICT_ALL_TABLES" mode, which rejects invalid data, we start to see the desired error behavior:
-- This mode produces warnings, not errors for certain problematic
-- operations.
SET SESSION sql_mode = 'STRICT_ALL_TABLES';
-- Try to insert data that requires the utf8mb4 character set (into
-- a table that only supports utf8).
INSERT INTO
char_test
SET
id = 200,
value = 'ha ha <?> you crazy!'
;
-- Check to see what value was inserted.
SELECT
value
FROM
char_test
WHERE
id = 200
;

When we run the above SQL, we get the following output:

   
  sql_mode STRICT_ALL_TABLES 
   
As you can see, this time, the INSERT fails completely and throws the error:
Error : Incorrect string value: '\xF0\x9F\x98\x81 y...' for column 'value' at row 1
This is definitely what we want. Even if the underlying problem is really that our data table has problematic character set support, we'd rather have the statements fail entirely rather than "looking" like they worked and quietly resulting in corrupted data.

0 comments:

Post a Comment