Tuesday, 17 July 2018

Mixed ASC/DESC sorting in MySQL

I have a table of blog posts, each with a foreign key back to it's author. There are less than 15,000entries in this table.
This query scans over 19,000 rows (per EXPLAIN), requires a filesort (that might be regular MySQLbehavior), and takes over 400ms to return 5 rows. possibly because of the complicated WHERE used to check if the item is actually published.
01.SELECT  blog_post.id,
02.blog_post.title,
03.blog_post.author_id,
04.blog_post.has_been_fact_checked,
05.blog_post.published_date,
06.blog_post.ordering,
07.auth_user.username,
08.auth_user.email
09.FROM    blog_post
10.INNER JOIN
11.auth_user
12.ON      auth_user.id = blog_post.author_id
13.WHERE   blog_post.is_approved = 1
14.AND blog_post.has_been_fact_checked = 1
15.AND blog_post.published_date <=  '2010-10-25 22:40:05'
16.ORDER BY
17.blog_post.published_date DESC,
18.blog_post.ordering ASC,
19.blog_post.id DESC
20.LIMIT 5

How can I wrangle this query under control?
This query is quite simple: a filtering condition with two equalities and a range and an order by. The range in the filter fits the ORDER BY perfectly, so the whole query could be executed using an index scan without any filesorts.
The only problem is that we have the mixed directions in ORDER BY, and MySQL does not support ASC / DESCclauses for the indexes.
With a simple table redesign, the problem could easily be solved: we would just need to reverse the order in the ordering column, say, by creating its copy and storing negative values in it. This way, we could just create a composite index (with all columns ascending) and rewrite the query to use the reverse column instead. That's what many engines do, MediaWiki (which Wikipedia runs on) being one of the most well-known examples.
This solution is nice, I hear people saying, but requires a database redesign which as we all know is never as simple as some development pundits on their blogs seem to think.
OK, this is a good point. Let's see what we can do with the current design, and, as always, create a sample table to do it:

Table creation details
There are 1,000 authors and 500,000 blog posts in the respective tables. The blog posts intentionally have duplicates on published_date.
The original query:
01.SELECT  blog_post.id,
02.blog_post.title,
03.blog_post.author_id,
04.blog_post.has_been_fact_checked,
05.blog_post.published_date,
06.blog_post.ordering,
07.auth_user.username,
08.auth_user.email
09.FROM    blog_post
10.INNER JOIN
11.auth_user
12.ON      auth_user.id = blog_post.author_id
13.WHERE   blog_post.is_approved = 1
14.AND blog_post.has_been_fact_checked = 1
15.AND blog_post.published_date <= '2010-10-25 22:40:05'
16.ORDER BY
17.blog_post.published_date DESC,
18.blog_post.ordering ASC,
19.blog_post.id DESC
20.LIMIT 5
idtitleauthor_idhas_been_fact_checkedpublished_dateorderingusernameemail
156410Post 15641098412010-10-25 22:00:001author984author984@example.com
451417Post 45141714012010-10-25 22:00:008author140author140@example.com
262749Post 26274971912010-10-25 22:00:0019author719author719@example.com
415157Post 41515743012010-10-25 22:00:0022author430author430@example.com
307578Post 30757861112010-10-25 22:00:0072author611author611@example.com
5 rows fetched in 0.0004s (0.4558s)
idselect_typetabletypepossible_keyskeykey_lenrefrowsfilteredExtra
1SIMPLEblog_postrefix_blogpost_approved_checked_published_ordering_idix_blogpost_approved_checked_published_ordering_id2const,const189693100.00Using where; Using filesort
1SIMPLEauth_usereq_refPRIMARYPRIMARY420101102_desc.blog_post.author_id1100.00
select `20101102_desc`.`blog_post`.`id` AS `id`,`20101102_desc`.`blog_post`.`title` AS `title`,`20101102_desc`.`blog_post`.`author_id` AS `author_id`,`20101102_desc`.`blog_post`.`has_been_fact_checked` AS `has_been_fact_checked`,`20101102_desc`.`blog_post`.`published_date` AS `published_date`,`20101102_desc`.`blog_post`.`ordering` AS `ordering`,`20101102_desc`.`auth_user`.`username` AS `username`,`20101102_desc`.`auth_user`.`email` AS `email` from `20101102_desc`.`blog_post` join `20101102_desc`.`auth_user` where ((`20101102_desc`.`auth_user`.`id` = `20101102_desc`.`blog_post`.`author_id`) and (`20101102_desc`.`blog_post`.`has_been_fact_checked` = 1) and (`20101102_desc`.`blog_post`.`is_approved` = 1) and (`20101102_desc`.`blog_post`.`published_date` <= '2010-10-25 22:40:05')) order by `20101102_desc`.`blog_post`.`published_date` desc,`20101102_desc`.`blog_post`.`ordering`,`20101102_desc`.`blog_post`.`id` desc limit 5
As the asker said, this query takes 455 ms and the filesort is used.
Now, as we cannot use a single index on mixed ASC / DESC ordering conditions, we should squeeze the most from our existing index.
The longest common prefix of the index and the ORDER BY expression (which implicitly includes all columns from the WHERE clause) is (is_approved, has_been_fact_checked, published_date).
What does it mean? It means that the chunks of data sharing the distinct values of these three columns will always go in the same order both in the index and in the query, though the order may and will differ within each chunk.
See the queries below, first one with the initial ORDER BY sorting:
01.SELECT  published_date, id, ordering
02.FROM    blog_post
03.WHERE   is_approved = 1
04.AND has_been_fact_checked = 1
05.ORDER BY
06.published_date DESC,
07.ordering ASC,
08.id DESC
09.LIMIT 16
published_dateidordering
2010-11-02 00:00:0011142997
 
2010-11-01 23:00:00486386
2010-11-01 23:00:002452457
2010-11-01 23:00:0015309919
2010-11-01 23:00:0033462433
2010-11-01 23:00:0020084445
2010-11-01 23:00:0016311452
2010-11-01 23:00:0031284754
2010-11-01 23:00:0020553172
2010-11-01 23:00:0026579594
 
2010-11-01 22:00:0040947524
2010-11-01 22:00:0021348345
2010-11-01 22:00:007257157
 
2010-11-01 21:00:0010295632
2010-11-01 21:00:0042257948
2010-11-01 21:00:0028029798
, the second one with the index sorting:
01.SELECT  published_date, id, ordering
02.FROM    blog_post
03.WHERE   is_approved = 1
04.AND has_been_fact_checked = 1
05.ORDER BY
06.published_date DESC,
07.ordering DESC,
08.id DESC
09.LIMIT 16
published_dateidordering
2010-11-02 00:00:0011142997
 
2010-11-01 23:00:0026579594
2010-11-01 23:00:0020553172
2010-11-01 23:00:0031284754
2010-11-01 23:00:0016311452
2010-11-01 23:00:0020084445
2010-11-01 23:00:0033462433
2010-11-01 23:00:0015309919
2010-11-01 23:00:002452457
2010-11-01 23:00:00486386
 
2010-11-01 22:00:007257157
2010-11-01 22:00:0021348345
2010-11-01 22:00:0040947524
 
2010-11-01 21:00:0028029798
2010-11-01 21:00:0042257948
2010-11-01 21:00:0010295632
As you can see, the chunks of data occupy the same places within both recordsets. Post dated 2010-11-02 00:00:00 is at row 1. Those dated 2010-11-01 23:00:00 are at rows 2 — 10 etc. The orders differ only within the chunks, but not between chunks.
This means that if we select the same chunks as the original query does and reorder them, we get the same order.
And the chunks we can select quite efficiently using the index, since their order, as we shown above, is the same for both queries.
The original query had LIMIT 5, this means we have to select at most 5 chunks. To do this, we just need to count off 5 distinct dates satisfying the condition.
This can be done using a technique I described in one of my earlier posts:

Selecting 5 dates

Here's a query to select 5 distinct dates less than the given one:
01.SELECT  bp.published_date
02.FROM    blog_post bp
03.WHERE   bp.is_approved = 1
04.AND bp.has_been_fact_checked = 1
05.AND bp.published_date < '2010-10-25 22:40:05'
06.AND bp.id =
07.(
08.SELECT  id
09.FROM    blog_post bpi
10.WHERE   bpi.is_approved = 1
11.AND bpi.has_been_fact_checked = 1
12.AND bpi.published_date = bp.published_date
13.ORDER BY
14.bpi.is_approved DESC, bpi.has_been_fact_checked DESC, bpi.published_date DESC
15.LIMIT 1
16.)
17.ORDER BY
18.bp.is_approved DESC, bp.has_been_fact_checked DESC, bp.published_date DESC
19.LIMIT 5
published_date
2010-10-25 22:00:00
2010-10-25 21:00:00
2010-10-25 20:00:00
2010-10-25 19:00:00
2010-10-25 18:00:00
5 rows fetched in 0.0001s (0.0018s)
idselect_typetabletypepossible_keyskeykey_lenrefrowsfilteredExtra
1PRIMARYbprangeix_blogpost_approved_checked_published_ordering_idix_blogpost_approved_checked_published_ordering_id11189693100.00Using where; Using index
2DEPENDENT SUBQUERYbpirefix_blogpost_approved_checked_published_ordering_idix_blogpost_approved_checked_published_ordering_id11const,const,20101102_desc.bp.published_date1100.00Using where; Using index
Field or reference '20101102_desc.bp.published_date' of SELECT #2 was resolved in SELECT #1
select `20101102_desc`.`bp`.`published_date` AS `published_date` from `20101102_desc`.`blog_post` `bp` where ((`20101102_desc`.`bp`.`has_been_fact_checked` = 1) and (`20101102_desc`.`bp`.`is_approved` = 1) and (`20101102_desc`.`bp`.`published_date` < '2010-10-25 22:40:05') and (`20101102_desc`.`bp`.`id` = (select `20101102_desc`.`bpi`.`id` from `20101102_desc`.`blog_post` `bpi` where ((`20101102_desc`.`bpi`.`has_been_fact_checked` = 1) and (`20101102_desc`.`bpi`.`is_approved` = 1) and (`20101102_desc`.`bpi`.`published_date` = `20101102_desc`.`bp`.`published_date`)) order by `20101102_desc`.`bpi`.`is_approved` desc,`20101102_desc`.`bpi`.`has_been_fact_checked` desc,`20101102_desc`.`bpi`.`published_date` desc limit 1))) order by `20101102_desc`.`bp`.`is_approved` desc,`20101102_desc`.`bp`.`has_been_fact_checked` desc,`20101102_desc`.`bp`.`published_date` desc limit 5
This uses the index access and is very efficient.

Selecting chunks of data

Now we need to select all records within the chunks defined by the dates. This can be done with a simple BETWEEN condition. The upper bound of the BETWEEN will be the date that we used as the parameter to the query, and the lower one will be the last date of those selected on the previous stage.
Since it should be a scalar, we will just replace LIMIT 5 (which returns 5 records) with LIMIT 4, 1 (which returns the last of these 5 records) and put it into a scalar subquery:
01.SELECT  *
02.FROM    blog_post bpo
03.WHERE   is_approved = 1
04.AND has_been_fact_checked = 1
05.AND published_date BETWEEN
06.(
07.SELECT  bp.published_date
08.FROM    blog_post bp
09.WHERE   bp.is_approved = 1
10.AND bp.has_been_fact_checked = 1
11.AND bp.published_date < '2010-10-25 22:40:05'
12.AND bp.id =
13.(
14.SELECT  id
15.FROM    blog_post bpi
16.WHERE   bpi.is_approved = 1
17.AND bpi.has_been_fact_checked = 1
18.AND bpi.published_date = bp.published_date
19.ORDER BY
20.bpi.is_approved DESC, bpi.has_been_fact_checked DESC, bpi.published_date DESC
21.LIMIT 1
22.)
23.ORDER BY
24.bp.is_approved DESC, bp.has_been_fact_checked DESC, bp.published_date DESC
25.LIMIT 4, 1
26.)
27.AND '2010-10-25 22:40:05'
idtitleis_approvedhas_been_fact_checkedpublished_dateorderingauthor_id
418694Post 418694112010-10-25 18:00:0016579
354679Post 354679112010-10-25 18:00:0044733
480570Post 480570112010-10-25 18:00:007062
402339Post 402339112010-10-25 18:00:0085498
282148Post 282148112010-10-25 18:00:0087152
358065Post 358065112010-10-25 18:00:0088301
 
328191Post 328191112010-10-25 19:00:0033627
95345Post 95345112010-10-25 19:00:0039497
73886Post 73886112010-10-25 19:00:00771
 
29391Post 29391112010-10-25 20:00:0013259
632Post 632112010-10-25 20:00:0015952
 
93117Post 93117112010-10-25 21:00:0065773
278821Post 278821112010-10-25 21:00:007249
 
156410Post 156410112010-10-25 22:00:001984
451417Post 451417112010-10-25 22:00:008140
262749Post 262749112010-10-25 22:00:0019719
415157Post 415157112010-10-25 22:00:0022430
225161Post 225161112010-10-25 22:00:0072398
307578Post 307578112010-10-25 22:00:0072611
426932Post 426932112010-10-25 22:00:009264
20 rows fetched in 0.0012s (0.0041s)
idselect_typetabletypepossible_keyskeykey_lenrefrowsfilteredExtra
1PRIMARYbporangeix_blogpost_approved_checked_published_ordering_idix_blogpost_approved_checked_published_ordering_id112075.00Using where
2SUBQUERYbprangeix_blogpost_approved_checked_published_ordering_idix_blogpost_approved_checked_published_ordering_id11189693100.00Using where; Using index
3DEPENDENT SUBQUERYbpirefix_blogpost_approved_checked_published_ordering_idix_blogpost_approved_checked_published_ordering_id1120101102_desc.bp.published_date1100.00Using where; Using index
Field or reference '20101102_desc.bp.published_date' of SELECT #3 was resolved in SELECT #2
select `20101102_desc`.`bpo`.`id` AS `id`,`20101102_desc`.`bpo`.`title` AS `title`,`20101102_desc`.`bpo`.`is_approved` AS `is_approved`,`20101102_desc`.`bpo`.`has_been_fact_checked` AS `has_been_fact_checked`,`20101102_desc`.`bpo`.`published_date` AS `published_date`,`20101102_desc`.`bpo`.`ordering` AS `ordering`,`20101102_desc`.`bpo`.`author_id` AS `author_id` from `20101102_desc`.`blog_post` `bpo` where ((`20101102_desc`.`bpo`.`has_been_fact_checked` = 1) and (`20101102_desc`.`bpo`.`is_approved` = 1) and (`20101102_desc`.`bpo`.`published_date` between (select `20101102_desc`.`bp`.`published_date` from `20101102_desc`.`blog_post` `bp` where ((`20101102_desc`.`bp`.`has_been_fact_checked` = 1) and (`20101102_desc`.`bp`.`is_approved` = 1) and (`20101102_desc`.`bp`.`published_date` < '2010-10-25 22:40:05') and (`20101102_desc`.`bp`.`id` = (select `20101102_desc`.`bpi`.`id` from `20101102_desc`.`blog_post` `bpi` where ((`20101102_desc`.`bpi`.`has_been_fact_checked` = 1) and (`20101102_desc`.`bpi`.`is_approved` = 1) and (`20101102_desc`.`bpi`.`published_date` = `20101102_desc`.`bp`.`published_date`)) order by `20101102_desc`.`bpi`.`is_approved` desc,`20101102_desc`.`bpi`.`has_been_fact_checked` desc,`20101102_desc`.`bpi`.`published_date` desc limit 1))) order by `20101102_desc`.`bp`.`is_approved` desc,`20101102_desc`.`bp`.`has_been_fact_checked` desc,`20101102_desc`.`bp`.`published_date` desc limit 4,1) and '2010-10-25 22:40:05'))
This query returns (almost instantly) 20 records in 5 chunks of different size (split for readability). Note that the chunks are arranged in order reverse to the order the dates were returned on the previous step: we omitted the ORDER BY clause and, by default, the index is traversed in forward (ASC) direction.

Ordering and joining

Now, all that's left to do is to add the correct ORDER BY condition to rearrange the data within the chunks, join the authors table and apply the LIMIT:
01.SELECT  bpo.id,
02.bpo.title,
03.bpo.author_id,
04.bpo.has_been_fact_checked,
05.bpo.published_date,
06.bpo.ordering,
07.au.username,
08.au.email
09.FROM    blog_post bpo
10.JOIN    auth_user au
11.ON      au.id = bpo.author_id
12.WHERE   bpo.is_approved = 1
13.AND bpo.has_been_fact_checked = 1
14.AND bpo.published_date BETWEEN
15.(
16.SELECT  bp.published_date
17.FROM    blog_post bp
18.WHERE   bp.is_approved = 1
19.AND bp.has_been_fact_checked = 1
20.AND bp.published_date < '2010-10-25 22:40:05'
21.AND bp.id =
22.(
23.SELECT  id
24.FROM    blog_post bpi
25.WHERE   bpi.is_approved = 1
26.AND bpi.has_been_fact_checked = 1
27.AND bpi.published_date = bp.published_date
28.ORDER BY
29.bpi.is_approved DESC, bpi.has_been_fact_checked DESC, bpi.published_date DESC
30.LIMIT 1
31.)
32.ORDER BY
33.bp.is_approved DESC, bp.has_been_fact_checked DESC, bp.published_date DESC
34.LIMIT 4, 1
35.)
36.AND '2010-10-25 22:40:05'
37.ORDER BY
38.bpo.published_date DESC,
39.bpo.ordering ASC,
40.bpo.id DESC
41.LIMIT 5
idtitleauthor_idhas_been_fact_checkedpublished_dateorderingusernameemail
156410Post 15641098412010-10-25 22:00:001author984author984@example.com
451417Post 45141714012010-10-25 22:00:008author140author140@example.com
262749Post 26274971912010-10-25 22:00:0019author719author719@example.com
415157Post 41515743012010-10-25 22:00:0022author430author430@example.com
307578Post 30757861112010-10-25 22:00:0072author611author611@example.com
5 rows fetched in 0.0004s (0.0044s)
idselect_typetabletypepossible_keyskeykey_lenrefrowsfilteredExtra
1PRIMARYbporangeix_blogpost_approved_checked_published_ordering_idix_blogpost_approved_checked_published_ordering_id1120100.00Using where; Using filesort
1PRIMARYaueq_refPRIMARYPRIMARY420101102_desc.bpo.author_id1100.00
2SUBQUERYbprangeix_blogpost_approved_checked_published_ordering_idix_blogpost_approved_checked_published_ordering_id11189693100.00Using where; Using index
3DEPENDENT SUBQUERYbpirefix_blogpost_approved_checked_published_ordering_idix_blogpost_approved_checked_published_ordering_id1120101102_desc.bp.published_date1100.00Using where; Using index
Field or reference '20101102_desc.bp.published_date' of SELECT #3 was resolved in SELECT #2
select `20101102_desc`.`bpo`.`id` AS `id`,`20101102_desc`.`bpo`.`title` AS `title`,`20101102_desc`.`bpo`.`author_id` AS `author_id`,`20101102_desc`.`bpo`.`has_been_fact_checked` AS `has_been_fact_checked`,`20101102_desc`.`bpo`.`published_date` AS `published_date`,`20101102_desc`.`bpo`.`ordering` AS `ordering`,`20101102_desc`.`au`.`username` AS `username`,`20101102_desc`.`au`.`email` AS `email` from `20101102_desc`.`blog_post` `bpo` join `20101102_desc`.`auth_user` `au` where ((`20101102_desc`.`au`.`id` = `20101102_desc`.`bpo`.`author_id`) and (`20101102_desc`.`bpo`.`has_been_fact_checked` = 1) and (`20101102_desc`.`bpo`.`is_approved` = 1) and (`20101102_desc`.`bpo`.`published_date` between (select `20101102_desc`.`bp`.`published_date` from `20101102_desc`.`blog_post` `bp` where ((`20101102_desc`.`bp`.`has_been_fact_checked` = 1) and (`20101102_desc`.`bp`.`is_approved` = 1) and (`20101102_desc`.`bp`.`published_date` < '2010-10-25 22:40:05') and (`20101102_desc`.`bp`.`id` = (select `20101102_desc`.`bpi`.`id` from `20101102_desc`.`blog_post` `bpi` where ((`20101102_desc`.`bpi`.`has_been_fact_checked` = 1) and (`20101102_desc`.`bpi`.`is_approved` = 1) and (`20101102_desc`.`bpi`.`published_date` = `20101102_desc`.`bp`.`published_date`)) order by `20101102_desc`.`bpi`.`is_approved` desc,`20101102_desc`.`bpi`.`has_been_fact_checked` desc,`20101102_desc`.`bpi`.`published_date` desc limit 1))) order by `20101102_desc`.`bp`.`is_approved` desc,`20101102_desc`.`bp`.`has_been_fact_checked` desc,`20101102_desc`.`bp`.`published_date` desc limit 4,1) and '2010-10-25 22:40:05')) order by `20101102_desc`.`bpo`.`published_date` desc,`20101102_desc`.`bpo`.`ordering`,`20101102_desc`.`bpo`.`id` desc limit 5
The records have been rearranged (using a filesort) and a LIMIT was applied.
Since the records are only taken from 5 chunks with 20 records in total, the filesort is not a problem at all and the query completes in only 4 ms.

Conclusion

MySQL does not support ASC / DESC clauses in the indices.
However, with an ORDER BY / LIMIT condition that mixes the orders of columns, it is possible to use an index.
To do that, we should limit the recordset to the minimal set that would still retain the order on the leading columns that share one direction. This can be done by selecting distinct set of those leading columns up to the original limit, and filtering on them using the index.
Then we can revert to a filesort to order this (much smaller) set on the remaining columns, which would be much more efficient.

0 comments:

Post a Comment