Performance analysis of MySQL's FULLTEXT indexes and LIKE queries for full text search
When searching for text in a MySQL table, you have two choices:
- The LIKE operator
- FULLTEXT indexes (which currently only work on MyISAM tables, but will one day work on InnoDB tables. The workaround right now is to extract your search text to a separate MyISAM table, so your main table can remain InnoDB.)
I always wondered how those two methods would scale as the number of records increases. So I made an experiment.
The experiment
Here is the setting:
- I created table of records that store searchable text in a
TEXT
column calledsearch_text
- The sample records in that table were generated texts of 50 random English words.
- I used different vocabulary sizes for the generated texts (we will see that this has major impact on FULLTEXT performance).
- I queried the table with a query of three random words from the same vocabulary.
- I used sample sizes from 1,000 to 100,000 records, meaning 50,000 to 5,000,000 indexed words.
- For each batch I measured the runtime of 50 queries, discarded the top 5 and bottom 5 measurements and averaged over the rest.
I compared two approaches:
- A LIKE query (
WHERE search_text LIKE "%word1%" AND search_text LIKE "%word2%" AND search_text LIKE "%word3%"
) - A boolean FULLTEXT query (
WHERE MATCH(search_text) AGAINST ("+word1 +word2 +word3" IN BOOLEAN MODE)
)
The results
Horizontal axis is number of records, vertical axis is average time to process a query:
I took away several lessons from this:
- LIKE is a lot more efficient than I thought
- For a medium-sized data set of up to 10,000 records (500,000 words) or so, LIKE only takes a fraction of a second. This means when optimizing a typical Rails action, you should probably look further than the database. A view can easily take many times longer to render. So measure before blaming the database.
- FULLTEXT performs better when your text has low redundancy
- FULLTEXT performance differs by a factor of 78 between a vocabulary of 1,000 words and 100,000 words. I guess that larger vocabularies result in a very wide but shallow inverted index that can quickly determine if a query has matches or not. An educated person has a passive vocabulary of 15,000 to 20,000 words, so FULLTEXT should work well for natural language texts.
- Both LIKE and FULLTEXT scale linearly
- While the FULLTEXT approach was many times faster than the LIKE approach in my tests, both approaches seem to scale linearly with the number of records. For a typical web projects where you need to index well under 5 million words, FULLTEXT will be fast enough to serve your searches until the project reaches end-of-life. But if you expect your data to grow indefinitely, FULLTEXT can only postpone the scaling pain and you will eventually need to deal with it, probably using a non-Mysql solution.
0 comments:
Post a Comment