Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save brianewilkins/0d79f1d0f5be84e01c5e68058070f6ef to your computer and use it in GitHub Desktop.
Save brianewilkins/0d79f1d0f5be84e01c5e68058070f6ef to your computer and use it in GitHub Desktop.
"Fuzzy search using the Double Metaphone algorithm" - Work in progress
Error in user YAML: (<unknown>): found a tab character that violate indentation while scanning a plain scalar at line 12 column 5
---
type: blog
title: Fuzzy search using the Double Metaphone algorithm
description: Using the Double Metaphone algorithm to find words that sound alike.
layout: default
date: 2019-07-02
author: Brian Wilkins
image: assets/img/chris-barbalis-1217112-unsplash.jpg
authorLink: 
relcanonical: 
tags:
  - Double Metaphone
  - views
	
---

Introduction

In an earlier article I wrote how to do a fuzzy search for documents based on what the words they contain sound like. The technique I described there uses a view that implements the Soundex algorithm. The aim of the Soundex algorithm is to encode words alike that sound alike so that they can be matched despite minor differences in spelling. Soundex was invented before the invention of the electronic computer and is fairly simple.

A more sophisticated algorithm with a similar purpose is Double Metaphone. Double Metaphone aims to yield more true matches and fewer false matches. It aims to work for non-English words as well as English words. For words that can be pronounced in more than one way it returns two different encodings. For example, for Wagner it returns FKNR for the German pronunciation in which W is pronounced as the v in vodka, and returns AKNR for the Anglicized pronunciation where W is pronounced as the w in water.

An implementation of the Double Metaphone algorithm in JavaScript is here. It is a function called doubleMetaphone. It returns an array of two strings, each string being an encoding that represents approximately the pronunciaton of the input string. With a few minor changes the function can be used in a Cloudant view. To turn it into the Map function of a Cloudant view, I simply removed the line:

module.exports = doubleMetaphone

and appended these lines to make a function that emits values returned by the doubleMetaphone function:

function (doc) {
  emit(doubleMetaphone(doc.name)[0], 1);   
  emit(doubleMetaphone(doc.name)[1], 1);
}

I have put the complete Cloudant view Map function here.

Trying it out

Now let's see how well the view does at finding matches among the many different spelling variations of the name of Tchaikovsky (the Russian composer of such works as the 1812 Overture and Swan Lake).

In the following steps I use the following conventions:

  • $USER stands for your Cloudant user name;
  • $PASS stands for password of user $USERNAME;
  • $ACCOUNT stands for the name of your Cloudant account;
  • $DB stands for the name of your database.

Creating the view

The design document _design/doubleMetaphone which contains the doubleMetaphone view is ddoc.txt. Write it to the database:

curl -u $USER:$PASS -X POST https://$ACCOUNT.cloudant.com/$DB -H "Content-Type: application/json" -d @ddoc.txt

Writing some test documents

Write some documents with variant spellings of the name:

curl -u $USER:$PASS -X POST https://$ACCOUNT.cloudant.com/$DB -H "Content-Type: application/json" -d '{"name": "Cajkovskij"}'
curl -u $USER:$PASS -X POST https://$ACCOUNT.cloudant.com/$DB -H "Content-Type: application/json" -d '{"name": "Chaikovsky"}'
curl -u $USER:$PASS -X POST https://$ACCOUNT.cloudant.com/$DB -H "Content-Type: application/json" -d '{"name": "Chaĭkovski"}'
curl -u $USER:$PASS -X POST https://$ACCOUNT.cloudant.com/$DB -H "Content-Type: application/json" -d '{"name": "Chaĭkovskiĭ"}'
curl -u $USER:$PASS -X POST https://$ACCOUNT.cloudant.com/$DB -H "Content-Type: application/json" -d '{"name": "Ciaikovski"}'
curl -u $USER:$PASS -X POST https://$ACCOUNT.cloudant.com/$DB -H "Content-Type: application/json" -d '{"name": "Ciaikovskij"}'
curl -u $USER:$PASS -X POST https://$ACCOUNT.cloudant.com/$DB -H "Content-Type: application/json" -d '{"name": "Ciaikovskji"}'
curl -u $USER:$PASS -X POST https://$ACCOUNT.cloudant.com/$DB -H "Content-Type: application/json" -d '{"name": "Ciaikovsky"}'
curl -u $USER:$PASS -X POST https://$ACCOUNT.cloudant.com/$DB -H "Content-Type: application/json" -d '{"name": "Csajkovszkij"}'
curl -u $USER:$PASS -X POST https://$ACCOUNT.cloudant.com/$DB -H "Content-Type: application/json" -d '{"name": "Czajkowski"}'
curl -u $USER:$PASS -X POST https://$ACCOUNT.cloudant.com/$DB -H "Content-Type: application/json" -d '{"name": "Tchaikofsky"}'
curl -u $USER:$PASS -X POST https://$ACCOUNT.cloudant.com/$DB -H "Content-Type: application/json" -d '{"name": "Tchaikovski"}'
curl -u $USER:$PASS -X POST https://$ACCOUNT.cloudant.com/$DB -H "Content-Type: application/json" -d '{"name": "Tchaikovskij"}'
curl -u $USER:$PASS -X POST https://$ACCOUNT.cloudant.com/$DB -H "Content-Type: application/json" -d '{"name": "Tchaikovsky"}'
curl -u $USER:$PASS -X POST https://$ACCOUNT.cloudant.com/$DB -H "Content-Type: application/json" -d '{"name": "Tchaikowsky"}'
curl -u $USER:$PASS -X POST https://$ACCOUNT.cloudant.com/$DB -H "Content-Type: application/json" -d '{"name": "Tchaïkovski"}'
curl -u $USER:$PASS -X POST https://$ACCOUNT.cloudant.com/$DB -H "Content-Type: application/json" -d '{"name": "Tchaïkovsky"}'
curl -u $USER:$PASS -X POST https://$ACCOUNT.cloudant.com/$DB -H "Content-Type: application/json" -d '{"name": "Tciaikowski"}'
curl -u $USER:$PASS -X POST https://$ACCOUNT.cloudant.com/$DB -H "Content-Type: application/json" -d '{"name": "Tjajkovskij"}'
curl -u $USER:$PASS -X POST https://$ACCOUNT.cloudant.com/$DB -H "Content-Type: application/json" -d '{"name": "Tschaijkowskij"}'
curl -u $USER:$PASS -X POST https://$ACCOUNT.cloudant.com/$DB -H "Content-Type: application/json" -d '{"name": "Tschaikousky"}'
curl -u $USER:$PASS -X POST https://$ACCOUNT.cloudant.com/$DB -H "Content-Type: application/json" -d '{"name": "Tschaikovsky"}'
curl -u $USER:$PASS -X POST https://$ACCOUNT.cloudant.com/$DB -H "Content-Type: application/json" -d '{"name": "Tschaikowski"}'
curl -u $USER:$PASS -X POST https://$ACCOUNT.cloudant.com/$DB -H "Content-Type: application/json" -d '{"name": "Tschaikowsky"}'
curl -u $USER:$PASS -X POST https://$ACCOUNT.cloudant.com/$DB -H "Content-Type: application/json" -d '{"name": "Tschajkowskij"}'
curl -u $USER:$PASS -X POST https://$ACCOUNT.cloudant.com/$DB -H "Content-Type: application/json" -d '{"name": "Tsjaikovsky"}'
curl -u $USER:$PASS -X POST https://$ACCOUNT.cloudant.com/$DB -H "Content-Type: application/json" -d '{"name": "Tsjaïkovskiej"}'
curl -u $USER:$PASS -X POST https://$ACCOUNT.cloudant.com/$DB -H "Content-Type: application/json" -d '{"name": "Čaikovskis"}'
curl -u $USER:$PASS -X POST https://$ACCOUNT.cloudant.com/$DB -H "Content-Type: application/json" -d '{"name": "Čajkovskij"}'

Querying the view to find the Double Metaphone encoding

Now by querying the view you can see that the Double Metaphone encoding of the name Tchaikovsky is XKFSK. There are two items in the result below because our view emits two values for each document. For some documents the view might emit different values to represent alternative ways of pronouncing the name in the document. However, in this case both of the emitted values are the same for just one way of pronouncing Tchaikovksy.

$ curl -s -u $USER:$PASS https://$ACCOUNT.cloudant.com/$DB/_design/doubleMetaphone/_view/doubleMetaphone?include_docs=true | jq '.rows[] | select(.doc.name=="Tchaikovsky")'
{
  "id": "b856704adaed6a725f6864727a1d09f8",
  "key": "XKFSK",
  "value": 1,
  "doc": {
    "_id": "b856704adaed6a725f6864727a1d09f8",
    "_rev": "1-bc039578b06ba135081a3b7a4baefb6c",
    "name": "Tchaikovsky"
  }
}
{
  "id": "b856704adaed6a725f6864727a1d09f8",
  "key": "XKFSK",
  "value": 1,
  "doc": {
    "_id": "b856704adaed6a725f6864727a1d09f8",
    "_rev": "1-bc039578b06ba135081a3b7a4baefb6c",
    "name": "Tchaikovsky"
  }
}

Querying the view to find similar sounding names

Now let's find other name values in our database for which the Double Metaphone encoding is the same as that of Tchaikovsky. Where the same name appears more than once in the result below, that means the two encodings emitted for that document by the view were the same.

$ curl -s -u $USER:$PASS https://$ACCOUNT.cloudant.com/$DB/_design/doubleMetaphone/_view/doubleMetaphone?key=\"XKFSK\"\&include_docs=true | jq '.rows[].doc.name'
"Czajkowski"
"Tchaikovskij"
"Tchaïkovski"
"Tchaïkovski"
"Tchaïkovsky"
"Tchaïkovsky"
"Chaĭkovskiĭ"
"Chaĭkovskiĭ"
"Tchaikowsky"
"Chaikovsky"
"Chaikovsky"
"Tchaikofsky"
"Tchaikofsky"
"Tchaikovsky"
"Tchaikovsky"
"Tchaikovski"
"Tchaikovski"
"Chaĭkovski"
"Chaĭkovski"

Conclusion

The view has identified some documents in which the name is a spelling variant Tchaikovky. Not every variant spelling of the name has been recognized, so it's not perfect. As spellings of the same name can vary so much, it is hard to see how it could ever be. Nevertheless, I think it has made a good swing at it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment