Monday, 12 November 2018

Mysql: Finding incomplete orders

Imagine we are keeping an online shop and want to find the customers that don't have complete orders.
We'll make the structure of the orders a little bit complex:
  • Each customer may have a number of baskets
  • Each basket will have a list of positions in it
  • Each position has a number of discounts
  • An order is considered complete, when all entites are present: there is at least one basket, all baskets should have at least one position, and each position should have at least one discount
We will keep the data in four tables as following:

Let's try to find all customers that have at least one incomplete order.
An incomplete order means that either there are no basket, or there are no positions in the basket, or there are no discounts for the position.
In our sample data, customer 1 has no baskets, customer 2 has no positions, and customer 3 has no discounts.
Having no baskets or no positions in a basket naturally also means not having any discounts. That's why we can utilize an outer join to find the missing rows, and filter out the discounts with a NULL id:
01.SELECT  DISTINCT cmr_id
02.FROM    t_customer
03.LEFT OUTER JOIN
04.t_basket
05.ON      bsk_customer = cmr_id
06.LEFT OUTER JOIN
07.t_position
08.ON      pos_basket = bsk_id
09.LEFT OUTER JOIN
10.t_discount
11.ON      dsc_position = pos_id
12.WHERE   dsc_id IS NULL
cmr_id
1
2
3
3 rows fetched in 0,0012s (3,2659s)
This runs for more than 3 seconds.
The query joins all four tables, selects all rows from them and finally filters out the rows that have a NULL dsc_id.
We have more than 100,000 rows in the largest tables, that's why these joins are quite costly.
We can reduce scanning: just find a first matching row for each customer that has a NULL dsc_id and then go to another customer instead of scanning all other basketspositions and discounts for this one.
This may be achieved by using NOT EXISTS:
01.SELECT  cmr_id
02.FROM    t_customer
03.WHERE   NOT EXISTS
04.(
05.SELECT  1
06.FROM    t_basket, t_position, t_discount
07.WHERE   bsk_customer = cmr_id
08.AND pos_basket = bsk_id
09.AND dsc_position = pos_id
10.)
cmr_id
1
2
3
3 rows fetched in 0,0012s (0,1618s)
This is much faster.
Now let's change the rules a little. We will consider an order complete only if it has at least two discounts for each position. In our sample data, customer 4 has exactly 1 discount for each position.
This becomes complicated now. We cannot use NOT EXISTS anymore, as it will fail if exactly one discount exists, which is not what we need.
We can of course filter out the positions having less than two discounts using the following query:
01.SELECT  cmr_id
02.FROM    t_customer
03.WHERE   NOT EXISTS
04.(
05.SELECT  1
06.FROM    t_basket
07.LEFT OUTER JOIN
08.t_position
09.ON      pos_basket = bsk_id
10.LEFT OUTER JOIN
11.t_discount
12.ON      dsc_position = pos_id
13.WHERE   bsk_customer = cmr_id
14.GROUP BY
15.bsk_id, pos_id
16.HAVING  COUNT(*) > 1
17.)
cmr_id
1
2
3
4
4 rows fetched in 0,0012s (5,6372s)
, which will give us the result we're after. But this query is quite slow.
How can we optimize it?
MySQL uses NESTED LOOPS to perform joins, with t_customer as a leading table. To calculate a COUNT(*), it first need to join all four tables, and then filter out the rows that have COUNT(*) >= 2.
But t_discount actually has all the information we need for filtering. We may first perform the filtering, and then join the other tables with the filtered results. Thus we can get rid of the unneeded joins.
So we'll make our query perform the following steps:
  1. We find all customers that don't have discounts at all. These customers don't have any records in the t_discount, but, fortunately, we already know how to filter them out fast
  2. We find all customers that have positions with exactly one discount. To do this, we select all such positions from t_discount by using COUNT(*), and perform the joins with the other tables. Joins will be performed after fitering.
  3. Finally, we UNION the results of the queries above
Let's see how it works:
01.SELECT  cmr_id
02.FROM    t_customer
03.WHERE   NOT EXISTS
04.(
05.SELECT  1
06.FROM    t_basket, t_position, t_discount
07.WHERE   bsk_customer = cmr_id
08.AND pos_basket = bsk_id
09.AND dsc_position = pos_id
10.)
11.UNION
12.SELECT  DISTINCT bsk_customer
13.FROM    (
14.SELECT  dsc_position
15.FROM    t_discount
16.GROUP BY
17.dsc_position
18.HAVING COUNT(*) < 2
19.) dd, t_position, t_basket
20.WHERE   pos_id = dsc_position
21.AND bsk_id = pos_basket
cmr_id
1
2
3
4
4 rows fetched in 0,0012s (0,7321s)
Much faster.

0 comments:

Post a Comment