Imagine we are keeping an online shop and want to find the customers that don't have complete orders.
We'll make the structure of the orders a little bit complex:
- Each customer may have a number of baskets
- Each basket will have a list of positions in it
- Each position has a number of discounts
- An order is considered complete, when all entites are present: there is at least one basket, all baskets should have at least one position, and each position should have at least one discount
We will keep the data in four tables as following:
Let's try to find all customers that have at least one incomplete order.
Let's try to find all customers that have at least one incomplete order.
An incomplete order means that either there are no basket, or there are no positions in the basket, or there are no discounts for the position.
In our sample data, customer 1 has no baskets, customer 2 has no positions, and customer 3 has no discounts.
Having no baskets or no positions in a basket naturally also means not having any discounts. That's why we can utilize an outer join to find the missing rows, and filter out the discounts with a NULL
id
:01.
SELECT
DISTINCT
cmr_id
02.
FROM
t_customer
03.
LEFT
OUTER
JOIN
04.
t_basket
05.
ON
bsk_customer = cmr_id
06.
LEFT
OUTER
JOIN
07.
t_position
08.
ON
pos_basket = bsk_id
09.
LEFT
OUTER
JOIN
10.
t_discount
11.
ON
dsc_position = pos_id
12.
WHERE
dsc_id
IS
NULL
This runs for more than 3 seconds.
The query joins all four tables, selects all rows from them and finally filters out the rows that have a NULL
dsc_id
.
We have more than 100,000 rows in the largest tables, that's why these joins are quite costly.
We can reduce scanning: just find a first matching row for each customer that has a NULL
dsc_id
and then go to another customer instead of scanning all other baskets, positions and discounts for this one.
This may be achieved by using
NOT EXISTS
:01.
SELECT
cmr_id
02.
FROM
t_customer
03.
WHERE
NOT
EXISTS
04.
(
05.
SELECT
1
06.
FROM
t_basket, t_position, t_discount
07.
WHERE
bsk_customer = cmr_id
08.
AND
pos_basket = bsk_id
09.
AND
dsc_position = pos_id
10.
)
This is much faster.
Now let's change the rules a little. We will consider an order complete only if it has at least two discounts for each position. In our sample data, customer 4 has exactly 1 discount for each position.
This becomes complicated now. We cannot use
NOT EXISTS
anymore, as it will fail if exactly one discount exists, which is not what we need.
We can of course filter out the positions having less than two discounts using the following query:
01.
SELECT
cmr_id
02.
FROM
t_customer
03.
WHERE
NOT
EXISTS
04.
(
05.
SELECT
1
06.
FROM
t_basket
07.
LEFT
OUTER
JOIN
08.
t_position
09.
ON
pos_basket = bsk_id
10.
LEFT
OUTER
JOIN
11.
t_discount
12.
ON
dsc_position = pos_id
13.
WHERE
bsk_customer = cmr_id
14.
GROUP
BY
15.
bsk_id, pos_id
16.
HAVING
COUNT
(*) > 1
17.
)
, which will give us the result we're after. But this query is quite slow.
How can we optimize it?
MySQL uses
NESTED LOOPS
to perform joins, with t_customer
as a leading table. To calculate a COUNT(*)
, it first need to join all four tables, and then filter out the rows that have COUNT(*) >= 2
.
But
t_discount
actually has all the information we need for filtering. We may first perform the filtering, and then join the other tables with the filtered results. Thus we can get rid of the unneeded joins.
So we'll make our query perform the following steps:
- We find all customers that don't have discounts at all. These customers don't have any records in the t_discount, but, fortunately, we already know how to filter them out fast
- We find all customers that have positions with exactly one discount. To do this, we select all such positions from t_discount by using COUNT(*), and perform the joins with the other tables. Joins will be performed after fitering.
- Finally, we UNION the results of the queries above
Let's see how it works:
01.
SELECT
cmr_id
02.
FROM
t_customer
03.
WHERE
NOT
EXISTS
04.
(
05.
SELECT
1
06.
FROM
t_basket, t_position, t_discount
07.
WHERE
bsk_customer = cmr_id
08.
AND
pos_basket = bsk_id
09.
AND
dsc_position = pos_id
10.
)
11.
UNION
12.
SELECT
DISTINCT
bsk_customer
13.
FROM
(
14.
SELECT
dsc_position
15.
FROM
t_discount
16.
GROUP
BY
17.
dsc_position
18.
HAVING
COUNT
(*) < 2
19.
) dd, t_position, t_basket
20.
WHERE
pos_id = dsc_position
21.
AND
bsk_id = pos_basket
Much faster.
0 comments:
Post a Comment