This gist describes various methods of getting similarity in python, and the advantages/disadvantages of each.
There is an implementation in python for the Ratcliff-Obershelp similarity algorith, that can be used to give a ratio of the fimilarity between two strings:
from difflib import SequenceMatcher
def similar_strings(to_compare:str, to_match:str) -> float:
"""Takes in two strings and returns a float of the percentage they are similar to each other
Parameters
----------
to_compare : str
The string you want to compare
to_match : str
The string you want to compare against
Returns
-------
float
The ratio of the similarity between to strings
"""
# Remove excess whitespace
to_compare = to_compare.strip()
to_match = to_match.strip()
return SequenceMatcher(None, to_compare, to_match).ratio()
similar_strings("biild", "build") # 0.7272727272
Can also be used to create a function that takes in a word and a list of possibilities then returns the most likely posibility. Uses above method:
def suggest_word(input_word:str, word_list:str) -> str:
"""Takes in a string and a list of words and returns the most likely word
Parameters
----------
input_word : str
The word you want to check for similarity
word_list : str
The list of words to test input_word against for similarity
Returns
-------
str
The most similar word, can also be empty string if none had more than %10 similarity
"""
similarities = {}
for current_word in word_list:
similarities[current_word] = similar_strings(input_word, current_word)
similarities = dict(sorted(similarities.items(),key=lambda x:x[1]))
print(similarities)
if list(similarities.values())[0] <= 0.1: # If the most likely suggestion has less than %10 likelyhood
return ""
for word in similarities:
return word # Return first word in dictionary
print(suggest_word("biiild", ["build", "init", "preview"])) # returns "build"
print(suggest_word("foo", ["build", "init", "preview"])) # returns ""
There is also a third-party wrapper for the above methods that make it simple to use right away. There is a package called fuzzywuzzy that does this for you.
By default it used the same difflib parser (though is less consistent, not sure why) and can be used for singe-word ratios like this:
from fuzzywuzzy import fuzz
print(fuzz.ratio("biiiild", "build")) #67
it also has built in bindings to allow for multiple suggestions easily (change limit for how many suggestion pairs you want):
from fuzzywuzzy import process
print(process.extract("biiild", ["build", "init", "preview"], limit=2)) # Returns [('build', 73), ('init', 45)]
fuzzywuzzy can also be used with python-Levenshtein to change from the Ratcliff-Obershelp similarity algorith to levenshtein distance (faster).
Details can be found on their github page. But to use it, juse the aove methods and install the package:
pip install python-Levenshtein