Skip to content

Instantly share code, notes, and snippets.

@dineshdharme
Created March 19, 2024 18:22
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dineshdharme/445ca2c3a841f868413793192bbc0723 to your computer and use it in GitHub Desktop.
Save dineshdharme/445ca2c3a841f868413793192bbc0723 to your computer and use it in GitHub Desktop.
Clustering names using simhash algorithm for further processing via fuzzywuzzy library for each matches.
https://stackoverflow.com/questions/78186018/fuzzy-logic-to-match-the-records-in-a-dataframe/78188853#78188853
I have taken inspiration from this blogpost to write the following code.
https://leons.im/posts/a-python-implementation-of-simhash-algorithm/
The `cluster_names` function just clusters the strings within the list based on the `cluster_threshold` value. You can tweak this value to get good results. You can also play around with `shingling_width` in `name_to_features`. You can create features of width=2,3,4,5 and so on and concatenate together.
Once you are satistifed your with your clusters, then you can further do `fuzzywuzzy` (this library has been renamed to `thefuzz`) matching to find more exact matches.
https://github.com/seatgeek/thefuzz
First install `simhash` python library, then run the following code.
`pip install simhash`
from simhash import Simhash
def simhash_distance(hash1, hash2):
return hash1.distance(hash2)
def name_to_features(name, shingling_width=2):
name = name.lower()
return [name[i:i + shingling_width] for i in range(len(name) - shingling_width + 1)]
def cluster_names(names_list, cluster_threshold=20):
clusters_internal = []
name_hashes = [(name, Simhash(name_to_features(name))) for name in names_list]
for name, hash_val in name_hashes:
found_cluster = False
for cluster_ele in clusters_internal:
if simhash_distance(cluster_ele['centroid'], hash_val) <= cluster_threshold:
cluster_ele['names'].append(name)
found_cluster = True
break
if not found_cluster:
clusters_internal.append({'centroid': hash_val, 'names': [name]})
return clusters_internal
# Example usage
names = ["Alice", "Alicia", "Alise", "Alyce", "Bob", "Bobb"]
clusters = cluster_names(names)
for i, cluster in enumerate(clusters, 1):
print(f"Cluster {i}: {cluster['names']}")
data = [
"Arvind Kathmandu",
"Arvind Kathmands",
"Arbind Kathmandu",
"Arvinds Kathmandu",
"Arveen Kathmandu",
"Arvins Kathmandu",
"Arvind Kathmandu Nepal",
"Abhishek Pokhara",
"Abhisheks Pokhara",
"Abhishek1 Pokhara",
"Abhishek2 Pokhara",
"Abhishek3 Pokhara"
]
clusters_data = cluster_names(data)
for i, cluster in enumerate(clusters_data, 1):
print(f"Cluster {i}: {cluster['names']}")
Output :
Cluster 1: ['Alice', 'Alicia', 'Alise', 'Alyce']
Cluster 2: ['Bob', 'Bobb']
Cluster 1: ['Arvind Kathmandu', 'Arvind Kathmands', 'Arbind Kathmandu', 'Arvinds Kathmandu', 'Arveen Kathmandu', 'Arvins Kathmandu', 'Arvind Kathmandu Nepal']
Cluster 2: ['Abhishek Pokhara', 'Abhisheks Pokhara', 'Abhishek1 Pokhara', 'Abhishek2 Pokhara', 'Abhishek3 Pokhara']
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment