One constant source of database/mySQL related frustration is the need to fetch a random row in a given table. We do randomization a lot, especially for anything recommendation-related so that users don’t see the exact same set of recommendations over and over again. This worked fine when our tables were small and everything fit into memory easily, but now that our tables have 6+ million rows in them, the old standby ORDER BY RAND() LIMIT 1 is no good.
ORDER BY RAND() in and of itself isn’t the worst thing in the universe, it does use a temp table to do all that sorting which is certainly less than ideal to begin with, but by the very nature of ORDERs and LIMITs, mySQL can’t apply the LIMIT 1 until it has ordered your result set. Good luck to you if your result set contains thousands of rows. That’s why we’ve taken to calling it ORDER BY SLOW(). You really have to think about how to apply ORDER BY RAND() so that it is functional and fast on massive data sets.
Now, a better solution would be:
SELECT count(*) FROM foo
and save the result as num_rows. Then
SELECT * FROM foo LIMIT [random number between 0 and num_rows],1
you just selected a random row in two quick queries! congrats!
There’s nothing really wrong with that way but it involves an unnecessary extra round-trip to the DB server and just feels inelegant. Also don’t think you can avoid coming back in to PHP between queries; mySQL will not allow you to use a mySQL variable (or sub-select clause or function call) as a LIMIT. In PostgreSQL this would be relatively simple: SELECT * FROM table OFFSET RANDOM() LIMIT 1; (obviously you would need slightly more complex logic to make sure RANDOM() is returning a legitimate value within the range of allowable OFFSETs, but this can all be done inside of SQL)
SELECT * FROM `table` WHERE id >= (SELECT FLOOR( MAX(id) * RAND()) FROM `table` ) ORDER BY id LIMIT 1;
Unfortunately mySQL executes the inner select for every single row comparison, so that is at least as slow as the original.
SELECT * FROM Table T JOIN (SELECT CEIL(MAX(ID)*RAND()) AS ID FROM Table) AS x ON T.ID >= x.ID LIMIT 1;
By joining on a nested (or delayed) select in this way, the inner select statement is executed just once. Again there are potential complications with this solution as well: if your IDs are non-sequential (i.e. some get deleted, or they are not an auto_increment field), it will be biased towards higher numbers and not truly random. However even if that is the case you might decide that it’s better to lose true randomness in order to get your query to finish running in .02 seconds instead of 4 minutes.
How do you make that work if you want more than one row? I currently don’t know of any way to do it, although I’m sure it’s possible. One trick would be to find a way to make the inner query return Y rows where Y is the number of rows you want the entire query to return. I know that it’s possible to do that but I haven’t put quite enough thought into how that can be accomplished to have a solution yet.
Another thing we are going to try out is seeding every row with a random value, and then doing our selects based on those, periodically giving those rows new random values. My understanding is that this is how Wikipedia does it. I might also experiment with having a stored procedure build the SQL, dynamically setting a LIMIT before executing the query, thereby preventing mySQL from complaining about using a variable for LIMIT.
ORDER BY RAND() in and of itself isn’t the worst thing in the universe, it does use a temp table to do all that sorting which is certainly less than ideal to begin with, but by the very nature of ORDERs and LIMITs, mySQL can’t apply the LIMIT 1 until it has ordered your result set. Good luck to you if your result set contains thousands of rows. That’s why we’ve taken to calling it ORDER BY SLOW(). You really have to think about how to apply ORDER BY RAND() so that it is functional and fast on massive data sets.
Now, a better solution would be:
SELECT count(*) FROM foo
and save the result as num_rows. Then
SELECT * FROM foo LIMIT [random number between 0 and num_rows],1
you just selected a random row in two quick queries! congrats!
There’s nothing really wrong with that way but it involves an unnecessary extra round-trip to the DB server and just feels inelegant. Also don’t think you can avoid coming back in to PHP between queries; mySQL will not allow you to use a mySQL variable (or sub-select clause or function call) as a LIMIT. In PostgreSQL this would be relatively simple: SELECT * FROM table OFFSET RANDOM() LIMIT 1; (obviously you would need slightly more complex logic to make sure RANDOM() is returning a legitimate value within the range of allowable OFFSETs, but this can all be done inside of SQL)
SELECT * FROM `table` WHERE id >= (SELECT FLOOR( MAX(id) * RAND()) FROM `table` ) ORDER BY id LIMIT 1;
Unfortunately mySQL executes the inner select for every single row comparison, so that is at least as slow as the original.
SELECT * FROM Table T JOIN (SELECT CEIL(MAX(ID)*RAND()) AS ID FROM Table) AS x ON T.ID >= x.ID LIMIT 1;
By joining on a nested (or delayed) select in this way, the inner select statement is executed just once. Again there are potential complications with this solution as well: if your IDs are non-sequential (i.e. some get deleted, or they are not an auto_increment field), it will be biased towards higher numbers and not truly random. However even if that is the case you might decide that it’s better to lose true randomness in order to get your query to finish running in .02 seconds instead of 4 minutes.
How do you make that work if you want more than one row? I currently don’t know of any way to do it, although I’m sure it’s possible. One trick would be to find a way to make the inner query return Y rows where Y is the number of rows you want the entire query to return. I know that it’s possible to do that but I haven’t put quite enough thought into how that can be accomplished to have a solution yet.
Another thing we are going to try out is seeding every row with a random value, and then doing our selects based on those, periodically giving those rows new random values. My understanding is that this is how Wikipedia does it. I might also experiment with having a stored procedure build the SQL, dynamically setting a LIMIT before executing the query, thereby preventing mySQL from complaining about using a variable for LIMIT.
0 comments:
Post a Comment