Tuesday 4 September 2018

Optimization of the MySQL query with SUM, the date range and Group By

I have following tables:

CREATE TABLE IF NOT EXISTS stats (
    date date NOT NULL DEFAULT '0000-00-00',
    cid int(8) NOT NULL DEFAULT '0',
    v bigint(15) NOT NULL DEFAULT '0',
    c bigint(15) NOT NULL DEFAULT '0',
    a bigint(15) NOT NULL DEFAULT '0',
PRIMARY KEY (date,cid),
KEY date (date),
KEY cid (cid),
KEY date_cid_vca (date,cid,v,c,a)
) ENGINE=MyISAM DEFAULT CHARSET=utf8

this table has 30,842,712 rows
and
CREATE TABLE IF NOT EXISTS camp (
id int(8) NOT NULL AUTO_INCREMENT,
name varchar(80) NOT NULL DEFAULT '',
PRIMARY KEY (id,name)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

this table has 1985 rows
I have the following query:
SELECT
    c.id,
    c.name,
    SUM(s.v) AS sumv,
    SUM(s.c) AS sumc,
    GREATEST(((SUM(s.c)/SUM(s.v))*100.00), 0.00) AS cratio,
    SUM(s.a) AS suma,
    GREATEST(((SUM(s.a)/SUM(s.c))*100.00), 0.00) AS aratio
FROM
    stats s, camp c
WHERE
    s.date >= '2012-02-01' AND
    s.date <= '2012-02-29' AND
    c.id=s.cid
GROUP BY s.cid;

EXPLAIN shows:
+----+-------------+-------+-------+-------------------------------+--------------+---------+---------------------+---------+-----------------------------------------------------------+
| id | select_type | table | type  | possible_keys                 | key          | key_len | ref                 | rows    | Extra                                                     |
+----+-------------+-------+-------+-------------------------------+--------------+---------+---------------------+---------+-----------------------------------------------------------+
|  1 | SIMPLE      | s     | range | PRIMARY,date,cid,date_cid_vca | date_cid_vca | 3       | NULL                | 1010265 | Using where; Using index; Using temporary; Using filesort |
|  1 | SIMPLE      | c     | ref   | PRIMARY                       | PRIMARY      | 4       | db.s.cid            |       1 | Using index                                               |
+----+-------------+-------+-------+-------------------------------+--------------+---------+---------------------+---------+-----------------------------------------------------------+

Problem is that the query takes around 50 seconds to complete even though it uses indexes. Is there any other way to optimize the query?
Thanks!

You've done well in optimizing this query with your indexes. I suppose you really do have over 1 million rows in stats in that date range. Unfortunately, joining (and then grouping) 1 million rows, even with a covering index, is a lot to ask from a database. For better performance, you will need to beef up the hardware, start denormalizing (put camp inside of stats to avoid the join), or keep running totals for each camp instead of calculating it on the fly.
Edit
Since removing the 1 million+ joins seemed to have made a large impact, you can try something like this:
SELECT c.*, a.* FROM
(SELECT
    SUM(s.v) AS sumv,
    SUM(s.c) AS sumc,
    GREATEST(((SUM(s.c)/SUM(s.v))*100.00), 0.00) AS cratio,
    SUM(s.a) AS suma,
    GREATEST(((SUM(s.a)/SUM(s.c))*100.00), 0.00) AS aratio,
    s.cid
FROM
    stats s
WHERE
    s.date >= '2012-02-01'
   AND s.date <= '2012-02-29'
GROUP BY s.cid) a
JOIN
  camp c
  ON c.id = a.cid

This query does the join on the smaller result set.

0 comments:

Post a Comment