Wednesday, 11 July 2018

An Easy Way to Speed Up MySQL FullText Natural Search on Large Tables


An Easy Way to Speed Up MySQL FullText Natural Search on Large Tables


As your WordPress or bbPress install grows over the years, if you have a large site you will notice the search slows down quite a bit on large tables. Another problem is the quality of search results becomes poor as it has to muck through more and more posts.

It’s difficult to jump to a solution like sphinx right away, and things like the google api are not aware enough about the context of your content, so some people will code their own searches within MySQL in the meanwhile.

For example, a minimal search is done something like the following, taking advantage of MySQL’s natural search mode which takes into account the number of times the words appears and their proximity to return a score.
1
2
3
SELECT post_id,  MATCH (post_text) AGAINST ('some words') AS score
FROM posts WHERE MATCH (post_text) AGAINST ('some words')
ORDER BY score DESC


In my situation, using a much more complex version of that query that takes into account all posts in a topic was taking up to 12 seconds which was completely unacceptable.

So it occurred to me that MySQL also has a binary search mode that is far faster. It’s not what I wanted for the overall results because it’s relevancy can be poor, but it instead could just be used to eliminate posts that would have to be scanned. Anytime you can eliminate rows in mysql you will get a speed boost.

So simply do this instead:
1
2
3
SELECT post_id,  MATCH (post_text) AGAINST ('some words') AS score
FROM posts WHERE MATCH (post_text) AGAINST ('some words' IN BOOLEAN MODE)
ORDER BY score DESC


Basically that IN BOOLEAN MODE will eliminate rows very quickly that do not have either of those words in them, but the SELECT still calculates a valuable natural search score for the remaining results.

When using boolean mode you have to take some extra steps to sanitize the query because some characters have special meaning so keep that in mind but it’s trivial to do.

Eventually I took it a step further and preprocessed the search terms even further.See Boolean mode has a score of it’s own, though it’s far less sophisticated than natural search. It simply returns the count of words found. Two word searches seem to return 0 to 2 as a score, three words, 0 to 3, etc.

So I count the words over 3 characters, preprocess them to not count stop words that mysql would ignore anyway, and used that as a minimal boolean count. If a user searched for three meaningful words, the minimal acceptable score would be 2 in boolean mode. You then just append it to the WHERE like so.


1
WHERE MATCH (post_text) AGAINST ('some more words' IN BOOLEAN MODE) > 1


(greater than 1 would be 2 or more words matching)

That accelerated the search to less than a second, 800ms in most cases and the results were still completely accurate, virtually identical to using natural search in the WHERE.

0 comments:

Post a Comment