Friday, 10 August 2018

Find The Nearest Word Using The Levenshtein Function In PHP

The levenshtein() function is part of a set of functions that are used to look at the structure of a string depending on how the string sounds, using levenshtein() allows you to look at the total difference between two strings, defined as a distance value. The important feature of this is that you can compare one string to another and see if they are similar.
The levenshtein() function takes two parameters, these are the two strings that you want to compare with each other. If the two strings are the same then the distance is zero, the higher this value is the more distance there is between two strings. Here are some examples.
  1. echo levenshtein('word','word'); // 0
  2. echo levenshtein('stone','magnet'); // 4
  3. echo levenshtein('window','windmill'); // 4
  4. echo levenshtein('wibble','wobble'); // 1
  5. echo levenshtein('test','toast'); // 2
  6. echo levenshtein('I','antidisestablishmentarianism'); // 28
So does this have any practical uses? The short answer is yes. One of the best things you can do with this function is to create either a spell checker or a Google style "did you mean" along with your search queries.
In order to do this you must first load in a dictionary of words that you can use. I found a good set of word lists on the Internet and doctored this to reduce the word count a little. This new English word list has 6,752 words and is about 65KB in size, which means it can be loaded into memory with minimal fuss.
This dictionary is loaded into memory by the use of the following bit of code.
  1. $dictionary = 'english.txt';
  2. $handle = fopen($dictionary, "r");
  3.  
  4. while(!feof($handle)){
  5. $words .= fread($handle, 8192);
  6. }
  7.  
  8. fclose($handle);
  9. $words = explode("\n",$words);
We now have an array of words. If we were to input a misspelled word the script now needs to use the dictionary to compare this word and find the one that is closest to the inputted word.
$input = stripslashes(strip_tags($_GET['q']));
First we loop through every word in our dictionary and compare it against out current word.
  1. foreach($words as $word){
  2. $lev = levenshtein($input, $word);
  3. }
There are two situations that we should encounter on each iteration of the loop.
  1. The input word will find an exact match in the array of words. This is found by a levenshtein distance of zero.
  2. The levenshtein() of the current word will be less than that of any other word looked at so far.
To account for the first situation we just say that if the levenshtein value is zero then break out of the loop. There is no point in running all the way through the rest of the array if we know that we already have what we are looking for.
The second situation is resolved by storing the smallest distance and using this to see if the next word is any less, and so on. Once the loop has finished we should have a word that closely matches our inputted word. We should set this distance to -1 initially as the levenshtein() function will never return less than zero.
  1. $distance = -1;
  2.  
  3. foreach($words as $word){
  4. $lev = levenshtein($input, $word);
  5.  
  6. // exact match
  7. if($lev == 0){
  8. $closest = $word;
  9. $distance = 0;
  10. // no need to continue as we have found exact match
  11. break;
  12. }
  13.  
  14. // if distance is less than the currently stored distance and it is less than our initial value
  15. if($lev
  16.  
  17. The $closest variable now contains our closest (or exactly) matching word and the $distance value shows the distance between the two words, which might be zero.
  18.  
  19. We can now print out the results. If we are creating this for a search function then we don't need to print out an exact match, just the closest matching word. A simple if statement to check if the distance is greater than zero solves this.
  20.  
  21. if($distance > 0){
  22. echo "Did you mean: ".$closest."?";
  23. }
  24.  
  25. Here is an example page of the code in action, I have printed out the input variable for convenience.
  26.  
  27. More information about levenshtein can be found at the levenshtein wikipedia page.

0 comments:

Post a Comment