Tuesday 4 September 2018

How to optimize the MySQL query with many unions

I have this query which I need help with. So there is a table called insertjobticketwith column called DEL which is a long character field which can have multiple dates in it. I need to create an output table which contains one row for each time there is a date in the DEL field for a certain range of dates.

The reason I can't just do a more simple select ... where ... DEL like "%my_date%" is that the DEL column can contain multiple dates, and if so, I need to return multiple rows to the output set, one row for each date that appears in the DEL column.
The solution I came up with that works, but is very slow looks like this:
create temporary table jobtrack.ship_helpert3 as
select * from
(
    (
    select
        date_format(now() - interval 3 day, '%m/%d/%Y') as `Ship_Date`,
        more_columns
    from
        jobticket.insertjobticket
    where
        DEL like concat('%',date_format(now() - interval 3 day, '%m/%d/%Y'),'%')
    ) union (
    select
        date_format(now() + interval 2 day, '%m/%d/%Y') as `Ship_Date`,
        more_columns
    from
        jobticket.insertjobticket
    where
        DEL like concat('%',date_format(now() + interval 2 day, '%m/%d/%Y'),'%')
    ) union (
    select
        date_format(now() + interval 1 day, '%m/%d/%Y') as `Ship_Date`,
        more_columns
    from
        jobticket.insertjobticket
    where
        DEL like concat('%',date_format(now() + interval 1 day, '%m/%d/%Y'),'%')
    ) union ...
) t;

Each select query checks if there are any rows with a certain date string (date_format(now() + interval @x day, '%m/%d/%Y')) in the DEL field. The query is built programmatically and can get very long, as I would like to be able to make the query check for many many dates.
The insertjobticket table contains 40K rows and is growing, so the query above takes way too long to complete. I understand why it takes so long, because every unioneffectively has to make its own sub-query that scans the whole table again and again for each date. I just don't know how to make this work more efficiently.
Does anyone know how to speed up this query?
Thanks for the help and let me know if we need more clarification.

As already stretched in the comments, the only correct solution would be to normalize your data, that means to create a new table with one delivery date and the primary key of insertjobticket per row, and let the application use this table directly instead of the column del, or at least indirectly by a trigger that updates this table everytime the column DEL is updated.
Since you cannot do that, the following workaround should improve your query:
select
  del_dates.Ship_Date,
  othercolumns
from insertjobticket
join (
    select concat(date_format(now() + interval 2 day, '%m/%d/%Y'))
           collate utf8_general_ci as Ship_Date
    union select concat(date_format(now() + interval 1 day, '%m/%d/%Y'))
    union select concat(date_format(now() + interval -15 day, '%m/%d/%Y'))
    ...
) del_dates
on insertjobticket.del like concat('%', del_dates.Ship_Date, '%');

(Change the collation to the one you use in your table or leave it away to see which one, if any, you need).
This will basically do the required normalization step (for the requested dates) every time you execute the query, and will not be able to use indexes. Just make sure your explain output shows using join buffer for the derived table, not for insertjobticket, otherwise replace join with a straight_join.
For 40k rows, this might not be a big a problem, and there is no other way around it anyway, except real normalization. Keep in mind that your query will slow down linearly with the amount of rows (400k rows will take about 10 times the time as 40k), an effect indexes would prevent. So if it is too slow now (or sometimes in the future), you eventually have to normalize (or, as a workaround to the problems created by this workaround, add a column to mark old entries and exclude them in your join condition).
Btw, since you generate your code programmatically, it shouldn't be a problem to create the list of dates, otherwise you can use another subquery to generate a list of general dates and just select the ones in a specific range.

0 comments:

Post a Comment