It's possible to order data randomly with MySQL using ORDER BY RAND() or doing a count on the table and then using LIMIT to choose a random offset. However neither of these are suitable for dealing with large tables with many rows. In this post I present an alternative to these but it does require modifications to the table and additional business logic and is still not perfect.
Add a "random value" column
This solution requires adding an additional column on the table (and in turn additional business logic) so you should really only use this solution if you need to get random data frequently in a timely manner, such as on publicly accessible pages on a website.
So, if you are sure you really do frequently need some data ordered randomly from a large table, add this column:
Unfortunately it's not possible to make the default value for a column be RAND() , so the values must be initialised separately and business logic added every time an INSERT is done:
Whenever a new record is added to the table make sure the RandomValue column is set:
Selecting a record at random
This needs to be done in a combination of SQL and scripting language. In this example I will use PHP because that's what I code with. $pdo is an already established connection to the MySQL database using PHP's PDO Data Object.
Lines 4 to 6 get a random number from the database. I use a SQL query to get this from the database to ensure the type and number is compatible with what's stored in the RandomValue field. I could not achieve this using PHP's rand() or mt_rand() functions.
Lines 8 and 9 then fetch the record from the database by finding the record whose RandomValue is greater than the random number selected.
Because $rand could be less than MIN(RandomValue) both queries are wrapped within a while loop which keeps going until a row is found. I have seen similar solutions to mine which then look for the RandomValue < $rand but this weights the lowest RandomValue so it will be selected more frequently than the other numbers.
Why not just do "SELECT * FROM test WHERE RandomValue > RAND() ORDER BY RandomValue LIMIT 1" instead of doing the first query to get a random number?
I did try doing that but when I tested it by running the query several thousand times against a table with 9 equally spaced random values from 0.1 to 0.9, but for some reason the lower numbers were selected more frequently and 0.8 and 0.9 almost not at all.
When I changed the query to select a number first and then code that into the query string on the second query I got an almost even distribution each time. Note that this was of course on an ideally spaced RandomValue table, created entirely for testing the even-ness of distribution.
Pros and Cons
The pros for using this method is that it's almost instant to get the record from the database because the random value is indexed. Compare this with ORDER BY RAND() and using LIMIT offsets on larger tables which can get very slow.
The cons on the other hand make this not the most suitable solution, but it does work and is quick.
1. There needs to be an additional column in the database and business logic needs to be added so that every time a new record is inserted it sets a random value for this field. (Noting again that the default value for the column cannot be rand()).
2. It is potentially possible that two or more records could have the same random value resulting in only one of them ever being selected at "random".
3. If the number of records is not particularly large, and/or the random spread of numbers is not particuarly even, then some records are more likely to be selected than others.
4. It is potentially possible (although unlikely) that the loop will get stuck infinitely if the random numbers selected are always less than the lowest number stored in the database.
Conclusion
Fetching random records from a MySQL database is tricky. The easy and obvious solutions work fine until the table starts to have a decent number of records and then the queries take longer and longer to run as the tables get bigger.
I have seen many "solutions" while researching this which easily fail as a proper random solution; the most common of which (other than those I've already presented in earlier posts) is to work out the minimum and maximum values of a primary key, select a random number based on those and either select using = or >=. This particular solution pre-supposes that the primary key will always run in sequence with no holes because the = solution will fail if the record no longer exists and the >= solution has greater weight for records directly after a hole.
The solution I have presented here also has greater weight for records that have a greater gap between the RandomValue if the values are not more or less evenly populated. The smaller the table, the less even the distribution will be.
So in conclusion, there are many approaches to choosing a random record from a MySQL table but none is perfect: they either get too slow as tables increase in size, or are fast but not genuinely random as some records will have more weight than others.
0 comments:
Post a Comment