Exploring ranking techniques in MySQL
While trying to wrangle a MySQL query to rank a set of results, I got sucked into a few Stack Overflow threads outlining various approaches.
In order to understand the pros and cons of each technique, I created some test data and reimplemented the solutions, all of which I’ve shared below.
If there are any other methods worth including here, please drop a note in the comments.
We’ll be working with this table:
1
2
3
4
5
6
7
8
9
10
11
| DROP TABLE IF EXISTS users; CREATE TABLE users (user_id INT PRIMARY KEY AUTO_INCREMENT, name VARCHAR(20), start_date DATE, team_id INT); INSERT INTO users (name, start_date, team_id) VALUES ('Matt', '2017-01-01', 1); INSERT INTO users (name, start_date, team_id) VALUES ('John', '2017-01-02', 2); INSERT INTO users (name, start_date, team_id) VALUES ('Sara', '2017-01-02', 2); INSERT INTO users (name, start_date, team_id) VALUES ('Tim', '2017-01-02', 3); INSERT INTO users (name, start_date, team_id) VALUES ('Bob', '2017-01-03', 3); INSERT INTO users (name, start_date, team_id) VALUES ('Bill', '2017-01-04', 3); INSERT INTO users (name, start_date, team_id) VALUES ('Kathy', '2017-01-04', 3); INSERT INTO users (name, start_date, team_id) VALUES ('Anne', '2017-01-05', 3); |
1
2
3
4
5
6
7
8
9
10
11
12
| +---------+-------+------------+---------+ | user_id | name | start_date | team_id | +---------+-------+------------+---------+ | 1 | Matt | 2017-01-01 | 1 | | 2 | John | 2017-01-02 | 2 | | 3 | Sara | 2017-01-02 | 2 | | 4 | Tim | 2017-01-02 | 3 | | 5 | Bob | 2017-01-03 | 3 | | 6 | Bill | 2017-01-04 | 3 | | 7 | Kathy | 2017-01-04 | 3 | | 8 | Anne | 2017-01-05 | 3 | +---------+-------+------------+---------+ |
Ranked by start date
1
2
3
| SELECT * FROM users ORDER by start_date ASC |
Ranked by start date with ties broken by user id
1
2
3
| SELECT * FROM users ORDER by start_date ASC, user_id ASC |
First employee by start date with ties broken by user id
1
2
3
4
| SELECT * FROM users ORDER by start_date ASC, user_id ASC LIMIT 1 |
First employee by start date with ties
1
2
3
| SELECT * FROM users WHERE start_date = (SELECT MIN(start_date) FROM users); |
Second employee by start date with ties broken by user id
1
2
3
4
5
| SELECT * FROM users ORDER by start_date ASC, user_id ASC LIMIT 1 OFFSET 1 |
Second employee by start date with ties
1
2
3
4
5
6
7
8
9
| SELECT * FROM users WHERE start_date = ( SELECT DISTINCT start_date FROM users ORDER BY start_date ASC LIMIT 1 OFFSET 1 ) |
Ranked by start date using variable
1
2
3
4
5
6
| SET @rank := 0; SELECT *, @rank := @rank + 1 AS rank FROM users ORDER BY start_date ASC |
1
2
3
4
5
6
7
8
9
10
11
12
| +---------+-------+------------+---------+------+ | user_id | name | start_date | team_id | rank | +---------+-------+------------+---------+------+ | 1 | Matt | 2017-01-01 | 1 | 1 | | 2 | John | 2017-01-02 | 2 | 2 | | 3 | Sara | 2017-01-02 | 2 | 3 | | 4 | Tim | 2017-01-02 | 3 | 4 | | 5 | Bob | 2017-01-03 | 3 | 5 | | 6 | Bill | 2017-01-04 | 3 | 6 | | 7 | Kathy | 2017-01-04 | 3 | 7 | | 8 | Anne | 2017-01-05 | 3 | 8 | +---------+-------+------------+---------+------+ |
Ranked by start date using a variable
Based on this Stack Overflow comment:
1
2
3
4
5
| SELECT *, @rank := @rank + 1 AS rank FROM users, (SELECT @rank := 0) r ORDER BY start_date ASC |
1
2
3
4
5
6
7
8
9
10
11
12
| +---------+-------+------------+---------+------------+------+ | user_id | name | start_date | team_id | @rank := 0 | rank | +---------+-------+------------+---------+------------+------+ | 1 | Matt | 2017-01-01 | 1 | 0 | 1 | | 2 | John | 2017-01-02 | 2 | 0 | 2 | | 3 | Sara | 2017-01-02 | 2 | 0 | 3 | | 4 | Tim | 2017-01-02 | 3 | 0 | 4 | | 5 | Bob | 2017-01-03 | 3 | 0 | 5 | | 6 | Bill | 2017-01-04 | 3 | 0 | 6 | | 7 | Kathy | 2017-01-04 | 3 | 0 | 7 | | 8 | Anne | 2017-01-05 | 3 | 0 | 8 | +---------+-------+------------+---------+------------+------+ |
First employee by start date using by setting a variable
1
2
3
4
5
6
7
8
9
10
| SET @rank := 0; SELECT * FROM ( SELECT *, @rank := @rank + 1 AS rank FROM users ORDER BY start_date ASC ) ranked WHERE rank = 1 |
Ranked by start date with ties
Based on this Stack Overflow comment:
1
2
3
4
5
6
7
8
9
10
11
| SET @prev_start_date = NULL; SET @rank := 0; SELECT *, CASE WHEN @prev_start_date = start_date THEN @rank -- Note that the assignment here will always be true WHEN @prev_start_date := start_date THEN @rank := @rank + 1 END AS rank FROM users ORDER BY start_date ASC |
1
2
3
4
5
6
7
8
9
10
11
12
| +---------+-------+------------+---------+------+ | user_id | name | start_date | team_id | rank | +---------+-------+------------+---------+------+ | 1 | Matt | 2017-01-01 | 1 | 1 | | 2 | John | 2017-01-02 | 2 | 2 | | 3 | Sara | 2017-01-02 | 2 | 2 | | 4 | Tim | 2017-01-02 | 3 | 2 | | 5 | Bob | 2017-01-03 | 3 | 3 | | 6 | Bill | 2017-01-04 | 3 | 4 | | 7 | Kathy | 2017-01-04 | 3 | 4 | | 8 | Anne | 2017-01-05 | 3 | 5 | +---------+-------+------------+---------+------+ |
Ranked by user id within each team
1
2
3
4
5
6
7
| SELECT a.*, COUNT(*) AS rank FROM users a INNER JOIN users b ON a.team_id = b.team_id AND a.user_id >= b.user_id GROUP BY a.team_id, a.user_id |
or, based on this Stack Overflow comment:
1
2
3
4
5
6
7
8
| SELECT a.*, ( SELECT COUNT(*) FROM users b WHERE a.team_id = b.team_id AND a.user_id >= b.user_id ) AS ranked FROM users a |
1
2
3
4
5
6
7
8
9
10
11
12
| +---------+-------+------------+---------+--------+ | user_id | name | start_date | team_id | ranked | +---------+-------+------------+---------+--------+ | 1 | Matt | 2017-01-01 | 1 | 1 | | 2 | John | 2017-01-02 | 2 | 1 | | 3 | Sara | 2017-01-02 | 2 | 2 | | 4 | Tim | 2017-01-02 | 3 | 1 | | 5 | Bob | 2017-01-03 | 3 | 2 | | 6 | Bill | 2017-01-04 | 3 | 3 | | 7 | Kathy | 2017-01-04 | 3 | 4 | | 8 | Anne | 2017-01-05 | 3 | 5 | +---------+-------+------------+---------+--------+ |
Note that both of these techniques require that there be a column without duplicates that we can rank on within the partition. For example, we can’t use
start_date
due to the duplicates within team 2 (2017-01-02
) and team 3 (2017-01-14
):
1
2
3
4
5
6
7
| SELECT a.*, COUNT(*) AS rank FROM users a INNER JOIN users b ON a.team_id = b.team_id AND a.start_date >= b.start_date GROUP BY a.team_id, a.user_id |
1
2
3
4
5
6
7
8
9
10
11
12
| +---------+-------+------------+---------+------+ | user_id | name | start_date | team_id | rank | +---------+-------+------------+---------+------+ | 1 | Matt | 2017-01-01 | 1 | 1 | | 2 | John | 2017-01-02 | 2 | 2 | | 3 | Sara | 2017-01-02 | 2 | 2 | | 4 | Tim | 2017-01-02 | 3 | 1 | | 5 | Bob | 2017-01-03 | 3 | 2 | | 6 | Bill | 2017-01-04 | 3 | 4 | | 7 | Kathy | 2017-01-04 | 3 | 4 | | 8 | Anne | 2017-01-05 | 3 | 5 | +---------+-------+------------+---------+------+ |
Return the last person to join within each team based on user id
Based on this Stack Overflow comment:
1
2
3
4
5
| SELECT a.* FROM users a LEFT JOIN users b ON a.team_id = b.team_id AND a.user_id < b.user_id WHERE b.team_id IS NULL |
or
1
2
3
4
5
6
7
| SELECT a.* FROM users a WHERE user_id IN ( SELECT MAX(user_id) FROM users GROUP BY team_id ) |
1
2
3
4
5
6
7
| +---------+------+------------+---------+ | user_id | name | start_date | team_id | +---------+------+------------+---------+ | 1 | Matt | 2017-01-01 | 1 | | 3 | Sara | 2017-01-02 | 2 | | 8 | Anne | 2017-01-05 | 3 | +---------+------+------------+---------+ |
Return the last people to join within each team based on start date
1
2
3
4
5
| SELECT a.* FROM users a LEFT JOIN users b ON a.team_id = b.team_id AND a.start_date < b.start_date WHERE b.team_id IS NULL |
or, based on this groupwise max post:
1
2
3
4
5
6
7
8
| SELECT a.* FROM users a INNER JOIN ( SELECT team_id, MAX(start_date) AS max_start_date FROM users b GROUP BY team_id ) max_start_dates ON a.team_id = max_start_dates.team_id AND a.start_date = max_start_dates.max_start_date |
1
2
3
4
5
6
7
8
| +---------+------+------------+---------+ | user_id | name | start_date | team_id | +---------+------+------------+---------+ | 1 | Matt | 2017-01-01 | 1 | | 2 | John | 2017-01-02 | 2 | | 3 | Sara | 2017-01-02 | 2 | | 8 | Anne | 2017-01-05 | 3 | +---------+------+------------+---------+ |
Ranked with gaps
Based on this Stack Overflow comment:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| SELECT user_id, name, start_date, team_id, rank FROM ( SELECT *, IF(start_date = @_last_start_date, @cur_rank := @cur_rank, @cur_rank := @_sequence) AS rank, @_sequence := @_sequence + 1, @_last_start_date := start_date FROM users, (SELECT @cur_rank := 1, @_sequence := 1, @_last_start_date := NULL) r ORDER BY start_date ) ranked |
Notice that after the three tied for second earliest start date, the next one jumps to 5 (not 3):
1
2
3
4
5
6
7
8
9
10
11
12
| +---------+-------+------------+---------+------+ | user_id | name | start_date | team_id | rank | +---------+-------+------------+---------+------+ | 1 | Matt | 2017-01-01 | 1 | 1 | | 2 | John | 2017-01-02 | 2 | 2 | | 3 | Sara | 2017-01-02 | 2 | 2 | | 4 | Tim | 2017-01-02 | 3 | 2 | | 5 | Bob | 2017-01-03 | 3 | 5 | | 6 | Bill | 2017-01-04 | 3 | 6 | | 7 | Kathy | 2017-01-04 | 3 | 6 | | 8 | Anne | 2017-01-05 | 3 | 8 | +---------+-------+------------+---------+------+ |
Technically this is known as the rank whereas the other examples, which didn’t include gaps, is the dense rank.
0 comments:
Post a Comment