Monday 16 July 2018

MySQL: ORDER BY RAND() – a case study of alternatives

MySQL: ORDER BY RAND() – a case study of alternatives


Many of you reading this post might already stopped by this problem – you just want a few random rows from your database and most of the times you might have used „ORDER BY RAND()“ and not thought about, what’s happening there. Just add all the columns you want to select, add JOINS, WHERE, etc. and add ORDER BY RAND() and LIMIT.
For small applications this might even scale well, but what happens in large applications with hundreds, thousands or millions of rows in a table?
I want to look a little closer into this. While developing my deprecated rhCMS (a content management system), I asked myself the same question.

Problem Analysis

First of all you got to ask „What happens with ORDER BY RAND() in a Query?“ and understand it.
The MySQL Database will query for all the rows you requested and filter out those that don’t match from WHERE, JOIN conditions etc. Then all rows will be in memory or in a temporary table. Then the RAND() will be replaced with a random number and attached to each row. After this, the set can be sorted ascending or descending. And then the LIMIT will be used. Using temporary tables is really bad because this causes a lot of problems and is not performant at all.

Solutions

Let’s take a look how we can solve this problem. First of all I want to show you a PHP solution:

A PHP solution

Query for the total number of rows:
1<?php echo $mysql_hl->highlight('SELECT COUNT(*) FROM table'); ?>
and then query for a random number between 1 and the number of rows:
1<?php echo $php_hl->highlight('<?php
2  $random_id = mt_rand(1, [ COUNT(*) - Ergebnis ]);
3?>'); ?>
Afterwards you can query for the ID:
1<?php echo $mysql_hl->highlight('SELECT colA, colB FROM table WHERE idCol = ...'); ?>
But as you probably already asked yourself:
  1. What happens when I delete rows?
  2. What if I want more than one row?
The first question seems rather easy, right? Just query as long as you don’t get a row. But this isn’t the best solution!
And how about multiple rows? The easiest way could be put another loop around it and query for random IDs the number of times you want. This seems to be acceptable but it still has some weaknesses. Let’s have a look at a MySQL based solution:

MySQL solution

The trick is to create a MySQL PROCEDURE that fills a temporary table with rows containing random IDs that are returned to the user. But then we would have the same problem: What if the ID didn’t exist or it exists twice in our temporary table?
The final solution looks like this:
  1. Create a temporary table to save random IDs and make the ID a UNIQUE KEY.
  2. Query the maximum ID
  3. Put the IDs into the table
  4. Return the dataset
You might ask: What happens if I insert an ID twice? The solution is simple: The error handler will recursively call our procedure. This way the UNIQUE KEY will make sure no ID is inserted twice and another ID is selected in this case.
And this is what the code looks like:
1<?php
2 
3echo $mysql_hl->highlight("DROP PROCEDURE IF EXISTS proc_get_random_post_id;
4 
5DELIMITER //
6 
7CREATE PROCEDURE proc_get_random_post_id (IN cnt INT)
8BEGIN
9  DECLARE max_id INTEGER;
10 
11  DECLARE EXIT HANDLER FOR SQLEXCEPTION
12  BEGIN
13    CALL proc_get_random_post_id(cnt);
14  END;
15 
16  IF cnt >= 1 THEN
17    DROP TEMPORARY TABLE IF EXISTS random;
18    CREATE TEMPORARY TABLE random ( post_id INTEGER(11) unsigned, UNIQUE KEY(post_id) );
19 
20    insert_loop : LOOP
21      IF cnt < 1 THEN
22        LEAVE insert_loop;
23      END IF;
24 
25      SELECT MAX(post_id) FROM posts INTO max_id;
26 
27      INSERT INTO random
28        SELECT
29         p.post_id
30        FROM
31          posts AS p
32        JOIN
33          (SELECT (RAND() *  max_id) AS random_id) AS r
34        WHERE
35          p.post_id >= r.random_id
36        ORDER BY
37          p.post_id
38        LIMIT
39          1;
40 
41      SET cnt = cnt - 1;
42    END LOOP insert_loop;
43 
44    SELECT post_id FROM random;
45  END IF;
46END//
47 
48DELIMITER ;");
49?>
Looks pretty much to get a few rows, huh? But is this really better than a PHP based solution? Let’s have a look at some benchmarks:
A problem could be the maximum recursion depth that is 0 by default. So make sure you increase it:
1<?php echo $mysql_hl->highlight('SET max_sp_resursion_depth = 2;'); ?>
Note: „2“ means the number of collisions here. So make sure to increase this value for larger tables.
With a  few rows (~ 25000) the procedure wasn’t that much larger but with ~230.000 rows the procedure was almost 10 times faster! And then I queried a forum with ~4500000 posts for a random post id. The results looks like this:
The first query with the following code took about 72 seconds:
1<?php echo $mysql_hl->highlight('SELECT post_id FROM posts ORDER BY RAND() LIMIT 20'); ?>
1mysql&gt; SELECT post_id FROM posts ORDER BY RAND() LIMIT 20;
2+---------+
3| post_id |
4+---------+
5| 2013034 |
6| 3579000 |
7| 3316421 |
8| 4093120 |
9| 2359496 |
10| 1885511 |
11| 3835027 |
12| 1046873 |
13| 1185652 |
14| 4331911 |
15| 4890819 |
16|  240006 |
17| 2437795 |
18| 1712680 |
19|  868551 |
20| 2096902 |
21| 3083704 |
22|  856132 |
23| 2190370 |
24| 4862958 |
25+---------+
2620 rows in set (52.66 sec)
The second one 35 seconds and the average was at about 25 seconds.

Note: When querying an ID with another column (e.g. a DATETIME field) the query only took about 5 seconds. And now let’s have a look into our procedure:

1
<?php echo $mysql_hl->highlight('CALL proc_get_random_post_id(20);'); ?>
1
mysql&gt; CALL proc_get_random_post_id(20);
2
  +---------+
3
  | post_id |
4
  +---------+
5
  |  354319 |
6
  |  436352 |
7
  |  762503 |
8
  |  924832 |
9
  | 1395933 |
10
  | 1515758 |
11
  | 1633515 |
12
  | 1731376 |
13
  | 1789180 |
14
  | 2512748 |
15
  | 2549002 |
16
  | 2854479 |
17
  | 2863860 |
18
  | 3103881 |
19
  | 3233293 |
20
  | 4337101 |
21
  | 4560893 |
22
  | 4596010 |
23
  | 4609112 |
24
  | 4960802 |
25
  +---------+
26
  20 rows in set (0.04 sec)
This small piece of code now calls our procedure in the background and only takes about 0,034 seconds!

This small example shows you how to quickly query for rows. And it is very easy to extend this to query for even more columns.

Another PHP solution
Someone asked me for a pure PHP solution and that’s what I came up with:

1
<?php echo $php_hl->highlight('<?php
2
  /**
3
   * Default Random ID Class
4
   *
5
   * @author Robert Hartung
6
   * @copyright www.roberthartung.de, 2014
7
   */
8

9
  class randomIdList
10
  {
11
    /**
12
     * MySQLI instance
13
     *
14
     * @var mysqli
15
     */
16

17
    protected $mysqli;
18

19
    /**
20
     * Maximum duplicate key errors
21
     *
22
     * @var integer
23
     */
24

25
    private $max_errors = 10;
26

27
    /**
28
     * Random ID Count
29
     *
30
     * @var integer
31
     */
32

33
    private $count;
34

35
    /**
36
     * Table to select from
37
     *
38
     * @var string
39
     */
40

41
    private $table;
42

43
    /**
44
     * Column that contains the ID
45
     *
46
     * @var string
47
     */
48

49
    private $column_id;
50

51
    /**
52
     * Constructor
53
     */
54

55
    public function __construct(mysqli $mysqli, $table, $column_id)
56
    {
57
      $this->mysqli = $mysqli;
58
      $this->table = $table;
59
      $this->column_id = $column_id;
60
    }
61

62
    /**
63
     * Returns a result set of random IDs
64
     *
65
     * @param boolean $return Query for the results?
66
     *
67
     * @return mysqli_result
68
     */
69

70
    public function getIDs($count, $return = true)
71
    {
72
      $this->count = $count;
73

74
      /* Do some preparing stuff */
75

76
      $this->mysqli->multi_query("DROP TEMPORARY TABLE IF EXISTS random;
77
      CREATE TEMPORARY TABLE random ( random_id INTEGER(11) unsigned, UNIQUE KEY(random_id) );
78
      SELECT @max_id := MAX(post_id) FROM pmcms_forum_posts");
79

80
      /* Fetch the results otherwise: Command out of sync error */
81

82
      while($this->mysqli->more_results())
83
      {
84
        $this->mysqli->next_result();
85
        $this->mysqli->use_result();
86
      }
87

88
      /* Prepare the statement so the query needs to be parsed only once! */
89

90
      $stmnt = $this->mysqli->prepare("INSERT INTO random
91
          SELECT
92
           ".$this->column_id."
93
          FROM
94
            ".$this->table." AS p
95
          JOIN
96
            (SELECT (RAND() * @max_id) AS random_id) AS r
97
          WHERE
98
            ".$this->column_id." >= r.random_id
99
          ORDER BY
100
            ".$this->column_id."
101
          LIMIT
102
            1");
103

104
      /* Get Random IDs */
105

106
      while($this->count > 0 && $this->max_errors > 0)
107
      {
108
         $stmnt->execute();
109

110
         /* if there was an error - go on */
111
         if($this->mysqli->errno === 1062)
112
         {
113
           $this->max_errors -= 1;
114
         }
115
         else
116
         {
117
           $this->count -= 1;
118
         }
119
      }
120

121
      /* If requested, return the result otherwise null*/
122

123
      if($return)
124
        return $this->mysqli->query("SELECT random_id FROM random");
125

126
      return null;
127
    }
128
  }
129
?>');
130
?>
The class expects a MySQLi instance as a parameter as well as the table name and your ID’s column name. Using the method getIDs($count, [ $return = true ]) you can ask for rows. The function uses the same concept as our MySQL procedure

Create a temporary table
Get maximum
Find random IDs
Return resulös
This object oriented solution has another benefit: You don’t have to create a procedure for every table in your database. You could easily extend this class to query for more than this column:

1
<?php echo $php_hl->highlight('<?php
2
  /**
3
   * Example Class for Posts
4
   *
5
   * @author Robert Hartung
6
   * @copyright www.roberthartung.de, 2014
7
   */
8

9
  final class randomPosts extends randomIdList
10
  {
11
    public function __construct(mysqli $mysqli)
12
    {
13
      parent::__construct($mysqli, \'forum_posts\', \'post_id\');
14
    }
15

16
    /**
17
     * Returns a different result set
18
     *
19
     * @return mysqli_result
20
     */
21

22
    public function getIDs($count)
23
    {
24
      parent::getIDs($count, false);
25

26
      return $this->mysqli->query("SELECT post_id FROM random JOIN forum_posts ON post_id = random_id");
27
    }
28

29
    public function getRandomPosts($count)
30
    {
31
      parent::getIDs($count, false);
32

33
      return $this->mysqli->query("SELECT post_id, post_datetime, post_topic_id FROM random JOIN forum_posts ON post_id = random_id");
34
    }
35
  }
36
?>');
37
?>
We could use this class like this:

1
<?php echo $php_hl->highlight('<?php
2
  /** Basic */
3

4
  $random = new randomIdList($mysqli, \'pmcms_forum_posts\', \'post_id\');
5
  $qry_random = $random->getIDs(5);
6

7
  while($number = $qry_random->fetch_assoc())
8
  {
9
    /* Do something */
10
  }
11

12
  /** Basic 2 */
13

14
  $qry_random = $random->getIDs(10);
15

16
  while($number = $qry_random->fetch_assoc())
17
  {
18
    /* Do something */
19
  }
20

21
  /** Inherited class */
22

23
  $random = new randomPosts($mysqli);
24
  $qry_random = $random->getIDs(5);
25

26
  while($number = $qry_random->fetch_assoc())
27
  {
28
    /* Do something */
29
  }
30

31
  $qry_random = $random->getRandomPosts(5);
32

33
  while($post = $qry_random->fetch_assoc())
34
  {
35
    /* Do something */
36
  }
37
?>');
38
?>
This should give you a good idea on how to avoid using MySQL’s ORDER BY RAND(). I appreciate any kind of feedback.

0 comments:

Post a Comment