I have a table of blog posts, each with a foreign key back to it's author. There are less than 15,000entries in this table.This query scans over 19,000 rows (perEXPLAIN
), requires afilesort
(that might be regular MySQLbehavior), and takes over 400ms to return 5 rows. possibly because of the complicatedWHERE
used to check if the item is actually published.01.
SELECT
blog_post.id,
02.
blog_post.title,
03.
blog_post.author_id,
04.
blog_post.has_been_fact_checked,
05.
blog_post.published_date,
06.
blog_post.ordering,
07.
auth_user.username,
08.
auth_user.email
09.
FROM
blog_post
10.
INNER
JOIN
11.
auth_user
12.
ON
auth_user.id = blog_post.author_id
13.
WHERE
blog_post.is_approved = 1
14.
AND
blog_post.has_been_fact_checked = 1
15.
AND
blog_post.published_date <=
'2010-10-25 22:40:05'
16.
ORDER
BY
17.
blog_post.published_date
DESC
,
18.
blog_post.ordering
ASC
,
19.
blog_post.id
DESC
20.
LIMIT 5
How can I wrangle this query under control?
This query is quite simple: a filtering condition with two equalities and a range and an order by. The range in the filter fits the
ORDER BY
perfectly, so the whole query could be executed using an index scan without any filesorts.
The only problem is that we have the mixed directions in
ORDER BY
, and MySQL does not support ASC / DESC
clauses for the indexes.
With a simple table redesign, the problem could easily be solved: we would just need to reverse the order in the
ordering
column, say, by creating its copy and storing negative values in it. This way, we could just create a composite index (with all columns ascending) and rewrite the query to use the reverse column instead. That's what many engines do, MediaWiki (which Wikipedia runs on) being one of the most well-known examples.This solution is nice, I hear people saying,
but requires a database redesign which as we all know is never as simple as some development pundits on their blogs seem to think.
OK, this is a good point. Let's see what we can do with the current design, and, as always, create a sample table to do it:
Table creation details
Table creation details
There are 1,000 authors and 500,000 blog posts in the respective tables. The blog posts intentionally have duplicates on
published_date
.
The original query:
01.
SELECT
blog_post.id,
02.
blog_post.title,
03.
blog_post.author_id,
04.
blog_post.has_been_fact_checked,
05.
blog_post.published_date,
06.
blog_post.ordering,
07.
auth_user.username,
08.
auth_user.email
09.
FROM
blog_post
10.
INNER
JOIN
11.
auth_user
12.
ON
auth_user.id = blog_post.author_id
13.
WHERE
blog_post.is_approved = 1
14.
AND
blog_post.has_been_fact_checked = 1
15.
AND
blog_post.published_date <=
'2010-10-25 22:40:05'
16.
ORDER
BY
17.
blog_post.published_date
DESC
,
18.
blog_post.ordering
ASC
,
19.
blog_post.id
DESC
20.
LIMIT 5
select `20101102_desc`.`blog_post`.`id` AS `id`,`20101102_desc`.`blog_post`.`title` AS `title`,`20101102_desc`.`blog_post`.`author_id` AS `author_id`,`20101102_desc`.`blog_post`.`has_been_fact_checked` AS `has_been_fact_checked`,`20101102_desc`.`blog_post`.`published_date` AS `published_date`,`20101102_desc`.`blog_post`.`ordering` AS `ordering`,`20101102_desc`.`auth_user`.`username` AS `username`,`20101102_desc`.`auth_user`.`email` AS `email` from `20101102_desc`.`blog_post` join `20101102_desc`.`auth_user` where ((`20101102_desc`.`auth_user`.`id` = `20101102_desc`.`blog_post`.`author_id`) and (`20101102_desc`.`blog_post`.`has_been_fact_checked` = 1) and (`20101102_desc`.`blog_post`.`is_approved` = 1) and (`20101102_desc`.`blog_post`.`published_date` <= '2010-10-25 22:40:05')) order by `20101102_desc`.`blog_post`.`published_date` desc,`20101102_desc`.`blog_post`.`ordering`,`20101102_desc`.`blog_post`.`id` desc limit 5
As the asker said, this query takes 455 ms and the
filesort
is used.
Now, as we cannot use a single index on mixed
ASC / DESC
ordering conditions, we should squeeze the most from our existing index.
The longest common prefix of the index and the
ORDER BY
expression (which implicitly includes all columns from the WHERE
clause) is (is_approved, has_been_fact_checked, published_date)
.
What does it mean? It means that the chunks of data sharing the distinct values of these three columns will always go in the same order both in the index and in the query, though the order may and will differ within each chunk.
See the queries below, first one with the initial
ORDER BY
sorting:01.
SELECT
published_date, id, ordering
02.
FROM
blog_post
03.
WHERE
is_approved = 1
04.
AND
has_been_fact_checked = 1
05.
ORDER
BY
06.
published_date
DESC
,
07.
ordering
ASC
,
08.
id
DESC
09.
LIMIT 16
, the second one with the index sorting:
01.
SELECT
published_date, id, ordering
02.
FROM
blog_post
03.
WHERE
is_approved = 1
04.
AND
has_been_fact_checked = 1
05.
ORDER
BY
06.
published_date
DESC
,
07.
ordering
DESC
,
08.
id
DESC
09.
LIMIT 16
As you can see, the chunks of data occupy the same places within both recordsets. Post dated 2010-11-02 00:00:00 is at row 1. Those dated 2010-11-01 23:00:00 are at rows 2 — 10 etc. The orders differ only within the chunks, but not between chunks.
This means that if we select the same chunks as the original query does and reorder them, we get the same order.
And the chunks we can select quite efficiently using the index, since their order, as we shown above, is the same for both queries.
The original query had
LIMIT 5
, this means we have to select at most 5 chunks. To do this, we just need to count off 5 distinct dates satisfying the condition.
This can be done using a technique I described in one of my earlier posts:
Selecting 5 dates
Here's a query to select 5 distinct dates less than the given one:
01.
SELECT
bp.published_date
02.
FROM
blog_post bp
03.
WHERE
bp.is_approved = 1
04.
AND
bp.has_been_fact_checked = 1
05.
AND
bp.published_date <
'2010-10-25 22:40:05'
06.
AND
bp.id =
07.
(
08.
SELECT
id
09.
FROM
blog_post bpi
10.
WHERE
bpi.is_approved = 1
11.
AND
bpi.has_been_fact_checked = 1
12.
AND
bpi.published_date = bp.published_date
13.
ORDER
BY
14.
bpi.is_approved
DESC
, bpi.has_been_fact_checked
DESC
, bpi.published_date
DESC
15.
LIMIT 1
16.
)
17.
ORDER
BY
18.
bp.is_approved
DESC
, bp.has_been_fact_checked
DESC
, bp.published_date
DESC
19.
LIMIT 5
Field or reference '20101102_desc.bp.published_date' of SELECT #2 was resolved in SELECT #1 select `20101102_desc`.`bp`.`published_date` AS `published_date` from `20101102_desc`.`blog_post` `bp` where ((`20101102_desc`.`bp`.`has_been_fact_checked` = 1) and (`20101102_desc`.`bp`.`is_approved` = 1) and (`20101102_desc`.`bp`.`published_date` < '2010-10-25 22:40:05') and (`20101102_desc`.`bp`.`id` = (select `20101102_desc`.`bpi`.`id` from `20101102_desc`.`blog_post` `bpi` where ((`20101102_desc`.`bpi`.`has_been_fact_checked` = 1) and (`20101102_desc`.`bpi`.`is_approved` = 1) and (`20101102_desc`.`bpi`.`published_date` = `20101102_desc`.`bp`.`published_date`)) order by `20101102_desc`.`bpi`.`is_approved` desc,`20101102_desc`.`bpi`.`has_been_fact_checked` desc,`20101102_desc`.`bpi`.`published_date` desc limit 1))) order by `20101102_desc`.`bp`.`is_approved` desc,`20101102_desc`.`bp`.`has_been_fact_checked` desc,`20101102_desc`.`bp`.`published_date` desc limit 5
This uses the index access and is very efficient.
Selecting chunks of data
Now we need to select all records within the chunks defined by the dates. This can be done with a simple
BETWEEN
condition. The upper bound of the BETWEEN
will be the date that we used as the parameter to the query, and the lower one will be the last date of those selected on the previous stage.
Since it should be a scalar, we will just replace
LIMIT 5
(which returns 5 records) with LIMIT 4, 1
(which returns the last of these 5 records) and put it into a scalar subquery:01.
SELECT
*
02.
FROM
blog_post bpo
03.
WHERE
is_approved = 1
04.
AND
has_been_fact_checked = 1
05.
AND
published_date
BETWEEN
06.
(
07.
SELECT
bp.published_date
08.
FROM
blog_post bp
09.
WHERE
bp.is_approved = 1
10.
AND
bp.has_been_fact_checked = 1
11.
AND
bp.published_date <
'2010-10-25 22:40:05'
12.
AND
bp.id =
13.
(
14.
SELECT
id
15.
FROM
blog_post bpi
16.
WHERE
bpi.is_approved = 1
17.
AND
bpi.has_been_fact_checked = 1
18.
AND
bpi.published_date = bp.published_date
19.
ORDER
BY
20.
bpi.is_approved
DESC
, bpi.has_been_fact_checked
DESC
, bpi.published_date
DESC
21.
LIMIT 1
22.
)
23.
ORDER
BY
24.
bp.is_approved
DESC
, bp.has_been_fact_checked
DESC
, bp.published_date
DESC
25.
LIMIT 4, 1
26.
)
27.
AND
'2010-10-25 22:40:05'
Field or reference '20101102_desc.bp.published_date' of SELECT #3 was resolved in SELECT #2 select `20101102_desc`.`bpo`.`id` AS `id`,`20101102_desc`.`bpo`.`title` AS `title`,`20101102_desc`.`bpo`.`is_approved` AS `is_approved`,`20101102_desc`.`bpo`.`has_been_fact_checked` AS `has_been_fact_checked`,`20101102_desc`.`bpo`.`published_date` AS `published_date`,`20101102_desc`.`bpo`.`ordering` AS `ordering`,`20101102_desc`.`bpo`.`author_id` AS `author_id` from `20101102_desc`.`blog_post` `bpo` where ((`20101102_desc`.`bpo`.`has_been_fact_checked` = 1) and (`20101102_desc`.`bpo`.`is_approved` = 1) and (`20101102_desc`.`bpo`.`published_date` between (select `20101102_desc`.`bp`.`published_date` from `20101102_desc`.`blog_post` `bp` where ((`20101102_desc`.`bp`.`has_been_fact_checked` = 1) and (`20101102_desc`.`bp`.`is_approved` = 1) and (`20101102_desc`.`bp`.`published_date` < '2010-10-25 22:40:05') and (`20101102_desc`.`bp`.`id` = (select `20101102_desc`.`bpi`.`id` from `20101102_desc`.`blog_post` `bpi` where ((`20101102_desc`.`bpi`.`has_been_fact_checked` = 1) and (`20101102_desc`.`bpi`.`is_approved` = 1) and (`20101102_desc`.`bpi`.`published_date` = `20101102_desc`.`bp`.`published_date`)) order by `20101102_desc`.`bpi`.`is_approved` desc,`20101102_desc`.`bpi`.`has_been_fact_checked` desc,`20101102_desc`.`bpi`.`published_date` desc limit 1))) order by `20101102_desc`.`bp`.`is_approved` desc,`20101102_desc`.`bp`.`has_been_fact_checked` desc,`20101102_desc`.`bp`.`published_date` desc limit 4,1) and '2010-10-25 22:40:05'))
This query returns (almost instantly) 20 records in 5 chunks of different size (split for readability). Note that the chunks are arranged in order reverse to the order the dates were returned on the previous step: we omitted the
ORDER BY
clause and, by default, the index is traversed in forward (ASC
) direction.Ordering and joining
Now, all that's left to do is to add the correct
ORDER BY
condition to rearrange the data within the chunks, join the authors table and apply the LIMIT
:01.
SELECT
bpo.id,
02.
bpo.title,
03.
bpo.author_id,
04.
bpo.has_been_fact_checked,
05.
bpo.published_date,
06.
bpo.ordering,
07.
au.username,
08.
au.email
09.
FROM
blog_post bpo
10.
JOIN
auth_user au
11.
ON
au.id = bpo.author_id
12.
WHERE
bpo.is_approved = 1
13.
AND
bpo.has_been_fact_checked = 1
14.
AND
bpo.published_date
BETWEEN
15.
(
16.
SELECT
bp.published_date
17.
FROM
blog_post bp
18.
WHERE
bp.is_approved = 1
19.
AND
bp.has_been_fact_checked = 1
20.
AND
bp.published_date <
'2010-10-25 22:40:05'
21.
AND
bp.id =
22.
(
23.
SELECT
id
24.
FROM
blog_post bpi
25.
WHERE
bpi.is_approved = 1
26.
AND
bpi.has_been_fact_checked = 1
27.
AND
bpi.published_date = bp.published_date
28.
ORDER
BY
29.
bpi.is_approved
DESC
, bpi.has_been_fact_checked
DESC
, bpi.published_date
DESC
30.
LIMIT 1
31.
)
32.
ORDER
BY
33.
bp.is_approved
DESC
, bp.has_been_fact_checked
DESC
, bp.published_date
DESC
34.
LIMIT 4, 1
35.
)
36.
AND
'2010-10-25 22:40:05'
37.
ORDER
BY
38.
bpo.published_date
DESC
,
39.
bpo.ordering
ASC
,
40.
bpo.id
DESC
41.
LIMIT 5
Field or reference '20101102_desc.bp.published_date' of SELECT #3 was resolved in SELECT #2 select `20101102_desc`.`bpo`.`id` AS `id`,`20101102_desc`.`bpo`.`title` AS `title`,`20101102_desc`.`bpo`.`author_id` AS `author_id`,`20101102_desc`.`bpo`.`has_been_fact_checked` AS `has_been_fact_checked`,`20101102_desc`.`bpo`.`published_date` AS `published_date`,`20101102_desc`.`bpo`.`ordering` AS `ordering`,`20101102_desc`.`au`.`username` AS `username`,`20101102_desc`.`au`.`email` AS `email` from `20101102_desc`.`blog_post` `bpo` join `20101102_desc`.`auth_user` `au` where ((`20101102_desc`.`au`.`id` = `20101102_desc`.`bpo`.`author_id`) and (`20101102_desc`.`bpo`.`has_been_fact_checked` = 1) and (`20101102_desc`.`bpo`.`is_approved` = 1) and (`20101102_desc`.`bpo`.`published_date` between (select `20101102_desc`.`bp`.`published_date` from `20101102_desc`.`blog_post` `bp` where ((`20101102_desc`.`bp`.`has_been_fact_checked` = 1) and (`20101102_desc`.`bp`.`is_approved` = 1) and (`20101102_desc`.`bp`.`published_date` < '2010-10-25 22:40:05') and (`20101102_desc`.`bp`.`id` = (select `20101102_desc`.`bpi`.`id` from `20101102_desc`.`blog_post` `bpi` where ((`20101102_desc`.`bpi`.`has_been_fact_checked` = 1) and (`20101102_desc`.`bpi`.`is_approved` = 1) and (`20101102_desc`.`bpi`.`published_date` = `20101102_desc`.`bp`.`published_date`)) order by `20101102_desc`.`bpi`.`is_approved` desc,`20101102_desc`.`bpi`.`has_been_fact_checked` desc,`20101102_desc`.`bpi`.`published_date` desc limit 1))) order by `20101102_desc`.`bp`.`is_approved` desc,`20101102_desc`.`bp`.`has_been_fact_checked` desc,`20101102_desc`.`bp`.`published_date` desc limit 4,1) and '2010-10-25 22:40:05')) order by `20101102_desc`.`bpo`.`published_date` desc,`20101102_desc`.`bpo`.`ordering`,`20101102_desc`.`bpo`.`id` desc limit 5
The records have been rearranged (using a
filesort
) and a LIMIT
was applied.
Since the records are only taken from 5 chunks with 20 records in total, the
filesort
is not a problem at all and the query completes in only 4 ms.Conclusion
MySQL does not support
ASC / DESC
clauses in the indices.
However, with an
ORDER BY / LIMIT
condition that mixes the orders of columns, it is possible to use an index.
To do that, we should limit the recordset to the minimal set that would still retain the order on the leading columns that share one direction. This can be done by selecting distinct set of those leading columns up to the original limit, and filtering on them using the index.
Then we can revert to a
filesort
to order this (much smaller) set on the remaining columns, which would be much more efficient.
0 comments:
Post a Comment