Skip to content

Instantly share code, notes, and snippets.

@hfukada
Created September 11, 2017 18:33
Show Gist options
  • Save hfukada/26425b5b8d37b5a0279d94d03575a61e to your computer and use it in GitHub Desktop.
Save hfukada/26425b5b8d37b5a0279d94d03575a61e to your computer and use it in GitHub Desktop.
word similarity
Background
Simplifying somewhat, each word of English is composed of a string of sounds, called "phones"/"phonemes". For example, "cat" has the phones /k/, /a/, /t/.
Some phones are more "confusable" with other phones--it's easier to mishear /m/ (as in "ram") as /n/ (as in "ran") than /t/ (as in "rat"). I happen to have a phone-to-phone confusion matrix, which tells me the probability of confusing any phone as any other phone.
If I want to calculate how easy it is to confuse one string of phones with another of equal length (e.g., hearing "ram" as "rat"), I simply multiply the probability of hearing each phone in the first string as the corresponding phone in the second string. E.g., P(/r/|/r/) * P(/a/|/a/) * P(/t/|/m/). Like I said, I get these probabilities from my confusion matrix.
Problem
That works fine and dandy for words of the same length, but I want to compare strings of phones of different lengths. If I have the probability of inserting or deleting phones, I should be able to compare the probability of hearing "ram" and "rant".
Now Zach, you might say, that's as simple as calculating the weighted Levehnstein (edit) distance between two strings! That's like Compsci 101! But the problem is that Levehnstein distance is the minimum edit distance between two strings.
I would assume that the probability of hearing the sounds in "ram" as "rant" would be the marginalization over all possible ways of mishearing it. For example, you could mishear "ram" as "rant" by mishearing the /m/ as /n/ and inserting /t/, but you could ALSO mishear it by mishearing /m/ as /t/ and inserting /n/.
The closest I've seen to something that does what I want was some people using a weighted finite-state automata as a kernel function for calculating slightly different stuff. I think they ended up only calculating the k closet paths for each item, however, so what I'm asking about might be technically impossible.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment