Using SOUNDEX and MySQL Full-Text Search for Fuzzy Matching

I recently received a question from a Xataface user about how to support searches for misspelled names. Out of the box, Xataface provides some fairly advanced search support that allows users to search on particular columns, across all columns in a single table, or across all tables (from a specified subset of the tables in the database). For the single-table searches, Xataface uses combinations of LIKE and = comparisons depending on the type of query and field type. E.g. If a user enters a search in the last_name column of the people table for “Williams”, it will perform an SQL query like the following, under the hood:

SELECT * FROM `people` where `last_name` LIKE 'Williams'

Note: This isn’t exactly how Xataface generates a query like this. This is simplified for readability.

This works great, but what if we wanted to be able to find results of misspellings of “Williams”. E.g. If the user entered a search for ‘Wlliams’, we still want the same result set to be found.

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. It happens to provide a very simple way to search for misspellings. Both PHP and MySQL include a SOUNDEX hashing function that will take string input and produce the SOUNDEX code for that input. The SOUNDEX code for strings that are misspelled are often the same. E.g.

SOUNDEX('Williams') === 'W452'

and

SOUNDEX('Wlliams') === 'W452'

This means that we can allow users to search for misspellings on the last_name field by modifying the SQL query as follows:

SELECT * FROM `people` where SOUNDEX(`last_name`) = SOUNDEX('Wlliams')

This should match a superset of the records that were matched in the original query. You might verbalize this query as “Find all people with last names that sound like Williams.

Matching Records that Contain A Word

The above example works great if we are searching on a column that contains a single word and we are matching against the whole word, but it falls short in cases where we need to find values that contain a search term. E.g. Suppose the people table also has a bio field that contains a short profile of the person. Now suppose we want to find all people whose bio contains the word “cardiac”. Xataface might generate an SQL query like:

SELECT * FROM `people` where `bio` like '%cardiac%'

Now if we want to find misspellings, we can’t simply do:

SELECT * FROM `people` where SOUNDEX(`bio`) like SOUNDEX('cardiac')

MySQL doesn’t provide a simple way to perform string splitting and mapping inside a WHERE clause so we would need to do some work prior to the search to allow us to search for misspellings in the bio field. One possible solution create an index on the bio field that includes soundex codes for each word. E.g. we could add a field named bio_soundex to the people table. Then every time the bio is changed we could update the bio_soundex. In Xataface we could perform this in the beforeSave trigger:

function beforeSave(Dataface_Record $record){
   if ( $record->valueChanged('bio') ){
      $words = preg_split(
          '/((^\p{P}+)|(\p{P}*\s+\p{P}*)|(\p{P}+$))/', 
          $record->val('bio'), 
          -1, 
          PREG_SPLIT_NO_EMPTY
      );
      $soundexWords = array();
      foreach ( $words as $word){
         $soundexWords[] = soundex($word);
      }
      $record->setValue('bio_soundex', implode(' ', $soundexWords));
   }
}

This example makes use of Xataface specific hooks and classes, but the concept of splitting the field content into individual words, producing the soundex code for each word, then saving the concatenation of these codes (separated by spaces) into another field can easily be done in any system.

Now, we are able to perform a query like this:

SELECT * FROM `people` WHERE `bio_soundex` LIKE CONCAT('%',SOUNDEX('crdiac'),'%')

And this will return all records where the bio contains the word “cardiac” (and many misspellings thereof).

Currently Xataface doesn’t automatically perform this type of soundex conversion for single-table searches. This example shows how you can do it yourself. However, as I describe next, Xataface does perform automatic soundex indexing for multi-table searches.

Fuzzy Matching in Multi-Table Search

Xataface has supported a multi-table search option for several years now. It works by maintaining a central index table with an entry for each record that can be searched. This index stores record details such as its table, id, URL, title, description, and searchable text which has a MySQL full-text search index that is used for the actual searching. Using MySQL’s full-text indexing feature allows Xataface to perform relevance sorting. This index is automatically updated each time a record is saved, and Xataface provides administrators with tools to manually clear and regenerate the index.

If a user enters a search for “Cardiac” in the multi-table search box, Xataface will generate an SQL query similar to:

SELECT 
  `record_id`, 
  `table`, 
  `record_url`, 
  `record_title`, 
  `record_description`, 
  `searchable_text`, 
  `lang`, 
  match(searchable_text) AGAINST ('Cardiac') as `relevance`
FROM dataface__index
WHERE `lang`='en' AND 
match(searchable_text) against ('Cardiac')
ORDER BY `relevance` DESC

In order to modify the index to support misspelled searches, I used a strategy similar to the one described above (for the bio field). At the time a record is indexed, I parse the content of the searchable_text into words, then produce a string of concatenated soundex codes corresponding to those words, and append them to searchable text field.

The actual code is as follows (From here:

$searchable_text = preg_replace('/<[^>]*?>/', ' ', $searchable_text);
$words = preg_split('/((^\p{P}+)|(\p{P}*\s+\p{P}*)|(\p{P}+$))/', $searchable_text, -1, PREG_SPLIT_NO_EMPTY);
$soundexAddons = array();
foreach ( $words as $word ){
    $soundexAddons[] = soundex(trim($word));
}

$searchable_text .= '[/////]: '.implode(' ', $soundexAddons);
$searchable_text = strip_tags($searchable_text);

$sql = "
    replace into dataface__index 
        (`record_id`,`table`,`record_url`,`record_title`,`record_description`,`lang`,`searchable_text`)
    values
    (
    '".addslashes($record->getId())."',
    '".addslashes($record->_table->tablename)."',
    '".addslashes($record->getPublicLink())."',
    '".addslashes($record->getTitle())."',
    '".addslashes(strip_tags($record->getDescription()))."',
    '".addslashes($lang)."',
    '".addslashes($searchable_text)."'
    )";
if ( !@xf_db_query($sql, df_db()) ){
    $this->createIndexTable();
    if ( !xf_db_query($sql, df_db()) ){
        trigger_error(xf_db_error(df_db()), E_USER_ERROR);
    }
}

At the time of the search, I perform a similar transformation to the search phrase to append the soundex codes of all words in the query to the end of the query. E.g. If the user searches for “Crdiac”, xataface will actually search for “Crdiac C632” (C632 is the soundex code for Cardiac).

This snippet shows how this is accomplished (From here:

$words = explode(' ', $query['-search']);
$soundexAddons = array();
foreach ( $words as $word){
    $soundexAddons[] = soundex($word);
}
$orig_search = $query['-search'];
$query['-search'] .= ' '.implode(' ', $soundexAddons);

So the resulting SQL query will be something like:

SELECT 
  `record_id`, 
  `table`, 
  `record_url`, 
  `record_title`, 
  `record_description`, 
  `searchable_text`, 
  `lang`, 
  match(searchable_text) AGAINST ('Cardiac C632') as `relevance`
FROM dataface__index
WHERE `lang`='en' AND 
match(searchable_text) against ('Cardiac C632')
ORDER BY `relevance` DESC

More Information

comments powered by Disqus