Skip to content

Instantly share code, notes, and snippets.

@Christopher-Thornton
Last active August 28, 2020 17:47
Show Gist options
  • Save Christopher-Thornton/5b2f86353a8067329a6cffa8e4b42d34 to your computer and use it in GitHub Desktop.
Save Christopher-Thornton/5b2f86353a8067329a6cffa8e4b42d34 to your computer and use it in GitHub Desktop.
import hmni
# Initialize a Matcher Object
matcher = hmni.Matcher(model='latin')
# Single Pair Similarity
matcher.similarity('Alan', 'Al')
# 0.6838303319889133
matcher.similarity('Alan', 'Al', prob=False)
# 1
matcher.similarity('Alan Turing', 'Al Turing', surname_first=False)
# 0.6838303319889133
# Record Linkage
import pandas as pd
df1 = pd.DataFrame({'name': ['Al', 'Mark', 'James', 'Harold']})
df2 = pd.DataFrame({'name': ['Mark', 'Alan', 'James', 'Harold']})
merged = matcher.fuzzymerge(df1, df2, how='left', on='name')
# Name Deduplication and Normalization
names_list = ['Alan', 'Al', 'Al', 'James']
matcher.dedupe(names_list, keep='longest')
# ['Alan', 'James']
matcher.dedupe(names_list, keep='frequent')
# ['Al, 'James']
matcher.dedupe(names_list, keep='longest', replace=True)
# ['Alan, 'Alan', 'Alan', 'James']
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment