Created
September 11, 2017 18:33
-
-
Save hfukada/26425b5b8d37b5a0279d94d03575a61e to your computer and use it in GitHub Desktop.
word similarity
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Background | |
Simplifying somewhat, each word of English is composed of a string of sounds, called "phones"/"phonemes". For example, "cat" has the phones /k/, /a/, /t/. | |
Some phones are more "confusable" with other phones--it's easier to mishear /m/ (as in "ram") as /n/ (as in "ran") than /t/ (as in "rat"). I happen to have a phone-to-phone confusion matrix, which tells me the probability of confusing any phone as any other phone. | |
If I want to calculate how easy it is to confuse one string of phones with another of equal length (e.g., hearing "ram" as "rat"), I simply multiply the probability of hearing each phone in the first string as the corresponding phone in the second string. E.g., P(/r/|/r/) * P(/a/|/a/) * P(/t/|/m/). Like I said, I get these probabilities from my confusion matrix. | |
Problem | |
That works fine and dandy for words of the same length, but I want to compare strings of phones of different lengths. If I have the probability of inserting or deleting phones, I should be able to compare the probability of hearing "ram" and "rant". | |
Now Zach, you might say, that's as simple as calculating the weighted Levehnstein (edit) distance between two strings! That's like Compsci 101! But the problem is that Levehnstein distance is the minimum edit distance between two strings. | |
I would assume that the probability of hearing the sounds in "ram" as "rant" would be the marginalization over all possible ways of mishearing it. For example, you could mishear "ram" as "rant" by mishearing the /m/ as /n/ and inserting /t/, but you could ALSO mishear it by mishearing /m/ as /t/ and inserting /n/. | |
The closest I've seen to something that does what I want was some people using a weighted finite-state automata as a kernel function for calculating slightly different stuff. I think they ended up only calculating the k closet paths for each item, however, so what I'm asking about might be technically impossible. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment