Skip to content

Instantly share code, notes, and snippets.

@Descent098
Last active July 16, 2023 14:12
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Descent098/dae85d0235acce5322bf1277d1372a7e to your computer and use it in GitHub Desktop.
Save Descent098/dae85d0235acce5322bf1277d1372a7e to your computer and use it in GitHub Desktop.
Word Similarity Methods in Python

This gist describes various methods of getting similarity in python, and the advantages/disadvantages of each.

Pure Python

There is an implementation in python for the Ratcliff-Obershelp similarity algorith, that can be used to give a ratio of the fimilarity between two strings:

from difflib import SequenceMatcher

def similar_strings(to_compare:str, to_match:str) -> float:
    """Takes in two strings and returns a float of the percentage they are similar to each other

    Parameters
    ----------
    to_compare : str
        The string you want to compare
    to_match : str
        The string you want to compare against

    Returns
    -------
    float
        The ratio of the similarity between to strings
    """
    # Remove excess whitespace
    to_compare = to_compare.strip()
    to_match = to_match.strip()
    return SequenceMatcher(None, to_compare, to_match).ratio()
    
similar_strings("biild", "build") # 0.7272727272

Can also be used to create a function that takes in a word and a list of possibilities then returns the most likely posibility. Uses above method:

def suggest_word(input_word:str, word_list:str) -> str:
    """Takes in a string and a list of words and returns the most likely word

    Parameters
    ----------
    input_word : str
        The word you want to check for similarity

    word_list : str
        The list of words to test input_word against for similarity

    Returns
    -------
    str
        The most similar word, can also be empty string if none had more than %10 similarity
    """
    similarities = {}
    for current_word in word_list:
        similarities[current_word] = similar_strings(input_word, current_word)
    similarities = dict(sorted(similarities.items(),key=lambda x:x[1]))
    print(similarities)
    if list(similarities.values())[0] <= 0.1: # If the most likely suggestion has less than %10 likelyhood
        return ""

    for word in similarities:
        return word         # Return first word in dictionary

print(suggest_word("biiild", ["build", "init", "preview"])) # returns "build"
print(suggest_word("foo", ["build", "init", "preview"])) # returns ""

Third-party wrapper

There is also a third-party wrapper for the above methods that make it simple to use right away. There is a package called fuzzywuzzy that does this for you.

By default it used the same difflib parser (though is less consistent, not sure why) and can be used for singe-word ratios like this:

from fuzzywuzzy import fuzz

print(fuzz.ratio("biiiild", "build")) #67

it also has built in bindings to allow for multiple suggestions easily (change limit for how many suggestion pairs you want):

from fuzzywuzzy import process
print(process.extract("biiild", ["build", "init", "preview"], limit=2)) # Returns [('build', 73), ('init', 45)]

Full third-party

fuzzywuzzy can also be used with python-Levenshtein to change from the Ratcliff-Obershelp similarity algorith to levenshtein distance (faster).

Details can be found on their github page. But to use it, juse the aove methods and install the package:

pip install python-Levenshtein

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment