How does MySQL's ORDER BY RAND() work? ~ Tech Blog

Thursday, 8 November 2018

How does MySQL's ORDER BY RAND() work?

I've been doing some research and testing on how to do fast random selection in MySQL. In the process I've faced some unexpected results and now I am not fully sure I know how ORDER BY RAND() really works.

I always thought that when you do ORDER BY RAND() on the table, MySQL adds a new column to the table which is filled with random values, then it sorts data by that column and then e.g. you take the above value which got there randomly. I've done lots of googling and testing and finally found that the query in my blog is indeed the fastest solution:

SELECT * FROM Table T JOIN (SELECT CEIL(MAX(ID)*RAND()) AS ID FROM Table) AS x ON T.ID >= x.ID LIMIT 1;

While common ORDER BY RAND() takes 30-40 seconds on my test table, his query does the work in 0.1 seconds. He explains how this functions in the blog so I'll just skip this and finally move to the odd thing.

My table is a common table with a PRIMARY KEY id and other non-indexed stuff like username, age, etc. Here's the thing I am struggling to explain

SELECT * FROM table ORDER BY RAND() LIMIT 1; /*30-40 seconds*/
SELECT id FROM table ORDER BY RAND() LIMIT 1; /*0.25 seconds*/
SELECT id, username FROM table ORDER BY RAND() LIMIT 1; /*90 seconds*/

I was sort of expecting to see approximately the same time for all three queries since I am always sorting on a single column. But for some reason this didn't happen. Please let me know if you any ideas about this. I have a project where I need to do fast ORDER BY RAND() and personally I would prefer to use

SELECT id FROM table ORDER BY RAND() LIMIT 1;
SELECT * FROM table WHERE id=ID_FROM_PREVIOUS_QUERY LIMIT 1;

which, yes, is slower than Jay's method, however it is smaller and easier to understand. My queries are rather big ones with several JOINs and with WHERE clause and while Jay's method still works, the query grows really big and complex because I need to use all the JOINs and WHERE in the JOINed (called x in his query) sub request.

Thanks for your time!

Answers

While there's no such thing as a "fast order by rand()", there is a workaround for your specific task.

For getting any single random row, you can do like this : https://thiscode4u.blogspot.com/2018/07/mysql-order-by-rand-case-study-of.html (I couldn't see a hotlink url. If anyone sees one, feel free to edit the link.)

The text is in german, but the SQL code is a bit down the page and in big white boxes, so it's not hard to see.

Basically what he does is make a procedure that does the job of getting a valid row. That generates a random number between 0 and max_id, try fetching a row, and if it doesn't exist, keep going until you hit one that does. He allows for fetching x number of random rows by storing them in a temp table, so you can probably rewrite the procedure to be a bit faster fetching only one row.

The downside of this is that if you delete A LOT of rows, and there are huge gaps, the chances are big that it will miss tons of times, making it ineffective.

Update: Different execution times

SELECT * FROM table ORDER BY RAND() LIMIT 1; /30-40 seconds/

SELECT id FROM table ORDER BY RAND() LIMIT 1; /0.25 seconds/

SELECT id, username FROM table ORDER BY RAND() LIMIT 1; /90 seconds/

I was sort of expecting to see approximately the same time for all three queries since I am always sorting on a single column. But for some reason this didn't happen. Please let me know if you any ideas about this.

It may have to do with indexing. id is indexed and quick to access, whereas adding username to the result, means it needs to read that from each row and put it in the memory table. With the * it also has to read everything into memory, but it doesn't need to jump around the data file, meaning there's no time lost seeking.

This makes a difference only if there are variable length columns (varchar/text), which means it has to check the length, then skip that length, as opposed to just skipping a set length (or 0) between each row.

I can tell you why the SELECT id FROM ... is much slower than the other two, but I am not sure, why SELECT id, username is 2-3 times faster than SELECT *.

When you have an index (the primary key in your case) and the result includes only the columns from the index, MySQL optimizer is able to use the data from the index only, does not even look into the table itself. The more expensive is each row, the more effect you will observe, since you substitute the filesystem IO operations with pure in-memory operations. If you will have an additional index on (id, username), you will have a similar performance in the third case as well.

Tech Blog

Thursday, 8 November 2018