Created
March 19, 2024 18:22
-
-
Save dineshdharme/445ca2c3a841f868413793192bbc0723 to your computer and use it in GitHub Desktop.
Clustering names using simhash algorithm for further processing via fuzzywuzzy library for each matches.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
https://stackoverflow.com/questions/78186018/fuzzy-logic-to-match-the-records-in-a-dataframe/78188853#78188853 | |
I have taken inspiration from this blogpost to write the following code. | |
https://leons.im/posts/a-python-implementation-of-simhash-algorithm/ | |
The `cluster_names` function just clusters the strings within the list based on the `cluster_threshold` value. You can tweak this value to get good results. You can also play around with `shingling_width` in `name_to_features`. You can create features of width=2,3,4,5 and so on and concatenate together. | |
Once you are satistifed your with your clusters, then you can further do `fuzzywuzzy` (this library has been renamed to `thefuzz`) matching to find more exact matches. | |
https://github.com/seatgeek/thefuzz | |
First install `simhash` python library, then run the following code. | |
`pip install simhash` | |
from simhash import Simhash | |
def simhash_distance(hash1, hash2): | |
return hash1.distance(hash2) | |
def name_to_features(name, shingling_width=2): | |
name = name.lower() | |
return [name[i:i + shingling_width] for i in range(len(name) - shingling_width + 1)] | |
def cluster_names(names_list, cluster_threshold=20): | |
clusters_internal = [] | |
name_hashes = [(name, Simhash(name_to_features(name))) for name in names_list] | |
for name, hash_val in name_hashes: | |
found_cluster = False | |
for cluster_ele in clusters_internal: | |
if simhash_distance(cluster_ele['centroid'], hash_val) <= cluster_threshold: | |
cluster_ele['names'].append(name) | |
found_cluster = True | |
break | |
if not found_cluster: | |
clusters_internal.append({'centroid': hash_val, 'names': [name]}) | |
return clusters_internal | |
# Example usage | |
names = ["Alice", "Alicia", "Alise", "Alyce", "Bob", "Bobb"] | |
clusters = cluster_names(names) | |
for i, cluster in enumerate(clusters, 1): | |
print(f"Cluster {i}: {cluster['names']}") | |
data = [ | |
"Arvind Kathmandu", | |
"Arvind Kathmands", | |
"Arbind Kathmandu", | |
"Arvinds Kathmandu", | |
"Arveen Kathmandu", | |
"Arvins Kathmandu", | |
"Arvind Kathmandu Nepal", | |
"Abhishek Pokhara", | |
"Abhisheks Pokhara", | |
"Abhishek1 Pokhara", | |
"Abhishek2 Pokhara", | |
"Abhishek3 Pokhara" | |
] | |
clusters_data = cluster_names(data) | |
for i, cluster in enumerate(clusters_data, 1): | |
print(f"Cluster {i}: {cluster['names']}") | |
Output : | |
Cluster 1: ['Alice', 'Alicia', 'Alise', 'Alyce'] | |
Cluster 2: ['Bob', 'Bobb'] | |
Cluster 1: ['Arvind Kathmandu', 'Arvind Kathmands', 'Arbind Kathmandu', 'Arvinds Kathmandu', 'Arveen Kathmandu', 'Arvins Kathmandu', 'Arvind Kathmandu Nepal'] | |
Cluster 2: ['Abhishek Pokhara', 'Abhisheks Pokhara', 'Abhishek1 Pokhara', 'Abhishek2 Pokhara', 'Abhishek3 Pokhara'] |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment