Descent098/Word Similarity in Python.md

## Word Similarity in Python.md

      
    Raw
  

              Word Similarity in Python.md
            
          
    This gist describes various methods of getting similarity in python, and the advantages/disadvantages of each.
Pure Python

There is an implementation in python for the Ratcliff-Obershelp similarity algorith, that can be used to give a ratio of the fimilarity between two strings:
from difflib import SequenceMatcher

def similar_strings(to_compare:str, to_match:str) -> float:
    """Takes in two strings and returns a float of the percentage they are similar to each other

    Parameters
    ----------
    to_compare : str
        The string you want to compare
    to_match : str
        The string you want to compare against

    Returns
    -------
    float
        The ratio of the similarity between to strings
    """
    # Remove excess whitespace
    to_compare = to_compare.strip()
    to_match = to_match.strip()
    return SequenceMatcher(None, to_compare, to_match).ratio()
    
similar_strings("biild", "build") # 0.7272727272
Can also be used to create a function that takes in a word and a list of possibilities then returns the most likely posibility. Uses above method:
def suggest_word(input_word:str, word_list:str) -> str:
    """Takes in a string and a list of words and returns the most likely word

    Parameters
    ----------
    input_word : str
        The word you want to check for similarity

    word_list : str
        The list of words to test input_word against for similarity

    Returns
    -------
    str
        The most similar word, can also be empty string if none had more than %10 similarity
    """
    similarities = {}
    for current_word in word_list:
        similarities[current_word] = similar_strings(input_word, current_word)
    similarities = dict(sorted(similarities.items(),key=lambda x:x[1]))
    print(similarities)
    if list(similarities.values())[0] <= 0.1: # If the most likely suggestion has less than %10 likelyhood
        return ""

    for word in similarities:
        return word         # Return first word in dictionary

print(suggest_word("biiild", ["build", "init", "preview"])) # returns "build"
print(suggest_word("foo", ["build", "init", "preview"])) # returns ""
Third-party wrapper

There is also a third-party wrapper for the above methods that make it simple to use right away. There is a package called fuzzywuzzy that does this for you.
By default it used the same difflib parser (though is less consistent, not sure why) and can be used for singe-word ratios like this:
from fuzzywuzzy import fuzz

print(fuzz.ratio("biiiild", "build")) #67
it also has built in bindings to allow for multiple suggestions easily (change limit for how many suggestion pairs you want):
from fuzzywuzzy import process
print(process.extract("biiild", ["build", "init", "preview"], limit=2)) # Returns [('build', 73), ('init', 45)]
Full third-party

fuzzywuzzy can also be used with python-Levenshtein to change from the Ratcliff-Obershelp similarity algorith to levenshtein distance (faster).
Details can be found on their github page. But to use it, juse the aove methods and install the package:
pip install python-Levenshtein