Tuesday 6 November 2018

What's faster, SELECT DISTINCT or GROUP BY in MySQL?

If I have a table
CREATE TABLE users (
  id int(10) unsigned NOT NULL auto_increment,
  name varchar(255) NOT NULL,
  profession varchar(255) NOT NULL,
  employer varchar(255) NOT NULL,
  PRIMARY KEY  (id)
)
and I want to get all unique values of profession field, what would be faster (or recommended):
SELECT DISTINCT u.profession FROM users u
or
SELECT u.profession FROM users u GROUP BY u.profession
?

 Answers


They are essentially equivalent to each other (in fact this is how some databases implement DISTINCT under the hood).
If one of them is faster, it's going to be DISTINCT. This is because, although the two are the same, a query optimizer would have to catch the fact that your GROUP BY is not taking advantage of any group members, just their keys. DISTINCT makes this explicit, so you can get away with a slightly dumber optimizer.
When in doubt, test!



Go for the simplest and shortest if you can -- DISTINCT seems to be more what you are looking for only because it will give you EXACTLY the answer you need and only that!




It seems that the queries are not exactly the same. At least for MySQL.
Compare:
  1. describe select distinct productname from northwind.products
  2. describe select productname from northwind.products group by productname
The second query gives additionally "Using filesort" in Extra.



(more of a functional note)
There are cases when you have to use GROUP BY, for example if you wanted to get the number of employees per employer:
SELECT u.employer, COUNT(u.id) AS "total employees" FROM users u GROUP BY u.employer
In such a scenario DISTINCT u.employer doesn't work right. Perhaps there is a way, but I just do not know it. (If someone knows how to make such a query with DISTINCT please add a note!)



After heavy testing we came to the conclusion that GROUP BY is faster
SELECT sql_no_cache opnamegroep_intern FROM telwerken WHERE opnemergroep IN (7,8,9,10,11,12,13) group by opnamegroep_intern
635 totaal 0.0944 seconds Weergave van records 0 - 29 ( 635 totaal, query duurde 0.0484 sec)
SELECT sql_no_cache distinct (opnamegroep_intern) FROM telwerken WHERE opnemergroep IN (7,8,9,10,11,12,13)
635 totaal 0.2117 seconds ( almost 100% slower ) Weergave van records 0 - 29 ( 635 totaal, query duurde 0.3468 sec)



Here is a simple approach which will print the 2 different elapsed time for each query.
DECLARE @t1 DATETIME;
DECLARE @t2 DATETIME;

SET @t1 = GETDATE();
SELECT DISTINCT u.profession FROM users u; --Query with DISTINCT
SET @t2 = GETDATE();
PRINT 'Elapsed time (ms): ' + CAST(DATEDIFF(millisecond, @t1, @t2) AS varchar);

SET @t1 = GETDATE();
SELECT u.profession FROM users u GROUP BY u.profession; --Query with GROUP BY
SET @t2 = GETDATE();
PRINT 'Elapsed time (ms): ' + CAST(DATEDIFF(millisecond, @t1, @t2) AS varchar);
SET STATISTICS TIME ON;
SELECT DISTINCT u.profession FROM users u; --Query with DISTINCT
SELECT u.profession FROM users u GROUP BY u.profession; --Query with GROUP BY
SET STATISTICS TIME OFF;
It simply displays the number of milliseconds required to parse, compile, and execute each statement as below:
 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 2 ms.



If the problem allows it, try with EXISTS, since it's optimized to end as soon as a result is found (And don't buffer any response), so, if you are just trying to normalize data for a WHERE clause like this
SELECT FROM SOMETHING S WHERE S.ID IN ( SELECT DISTINCT DCR.SOMETHING_ID FROM DIFF_CARDINALITY_RELATIONSHIP DCR ) -- to keep same cardinality
A faster response would be:
SELECT FROM SOMETHING S WHERE EXISTS ( SELECT 1 FROM DIFF_CARDINALITY_RELATIONSHIP DCR WHERE DCR.SOMETHING_ID = S.ID )
This isn't always possible but when available you will see a faster response.

0 comments:

Post a Comment