Part of what made debugging the incident frustrating was the fact that the poorly-performing queries showed up in MySQL's slow_log table at the very end of the incident window despite the fact that they clearly started the incident. I have long suspected that the "start_time" column of the slow_log table was misleading; but, this past debugging session finally gave me the motivation to dig into the mechanics of the slow query log and build a better mental model for what the "start_time" column actually represents.
Testing the "start_time" column is fairly straightforward. All we have to do is run two SLEEP() queries back-to-back and see how the "start_time" is reported in the slow query log. Assuming that we use the following queries:
- SLEEP( 2 )
- SLEEP( 20 )
... if the "start_time" is the true initiation time of the queries, then the "start_time" for these two queries will be 2-seconds apart since the second query started 2-seconds after the first one. However, if the "start_time" column actually represents the "end" time or the "log" time, then these two queries will be 20-seconds apart since the second query finished 20-seconds after the first query finished.
As you can see, for this database connection, we're enabling the slow query log and we're setting its threshold to 1-second (the default is 10-seconds). This way, both of our SLEEP() queries will be captured. And, when we run this test in the database, we get the following output:
As you can see, the "start_time" of the two queries, as reported by MySQL's slow query log, is 20-seconds apart. This proves that the "start_time" is, in fact, the "end" time or the "log" time of the queries. This also explains why long-running queries that caused an incident wouldn't actually show up in the slow_log until much later in the incident's timeline.
Luckily, the slow_log table gives us both the "start_time" and the "query_time" columns. And, since the "query_time" represents the duration of the query execution, we can calculate the true "start" time of the query by subtracting the query_time from the reported start_time:
normalized_start_time = ( "start_time" - "query_time" )
At work, I use a time-boxed query to search through MySQL's slow query log so that I can look at queries during an incident window. Given this new insight into the meaning of the "start_time" column, I'm going to update my go-to debugging query to include a normalized start time. This way, I'll see slow queries in the order in which they were initiated, not in the order in which they returned:
As you can see, I'm ordering the query based on the delta of the two columns ("start_time" and "query_time"):
Then, in the SELECT clause, I'm formatting this delta as a nice, easy-to-read time-stamp:
NOTE: The "%T" mask outputs the 24-hour time as "HH:mm:ss".
I think this will make MySQL's slow_log table significantly more intuitive. And, it will make debugging slow-query-based incidents much easier to spot since the slow queries will now show up at the time in which they cause an impact as opposed to the time in which they finally return with a result.
0 comments:
Post a Comment