dineshdharme/ClusteringNamesUsingSimHashing.py

## ClusteringNamesUsingSimHashing.py
https://stackoverflow.com/questions/78186018/fuzzy-logic-to-match-the-records-in-a-dataframe/78188853#78188853

I have taken inspiration from this blogpost to write the following code.

https://leons.im/posts/a-python-implementation-of-simhash-algorithm/

The `cluster_names` function just clusters the strings within the list based on the `cluster_threshold` value. You can tweak this value to get good results. You can also play around with `shingling_width` in `name_to_features`. You can create features of width=2,3,4,5 and so on and concatenate together.

Once you are satistifed your with your clusters, then you can further do `fuzzywuzzy` (this library has been renamed to `thefuzz`) matching to find more exact matches.

https://github.com/seatgeek/thefuzz


First install `simhash` python library, then run the following code.

`pip install simhash`


    from simhash import Simhash


    def simhash_distance(hash1, hash2):
        return hash1.distance(hash2)


    def name_to_features(name, shingling_width=2):
        name = name.lower()
        return [name[i:i + shingling_width] for i in range(len(name) - shingling_width + 1)]


    def cluster_names(names_list, cluster_threshold=20):
        clusters_internal = []
        name_hashes = [(name, Simhash(name_to_features(name))) for name in names_list]

        for name, hash_val in name_hashes:
            found_cluster = False
            for cluster_ele in clusters_internal:
                if simhash_distance(cluster_ele['centroid'], hash_val) <= cluster_threshold:
                    cluster_ele['names'].append(name)
                    found_cluster = True
                    break
            if not found_cluster:
                clusters_internal.append({'centroid': hash_val, 'names': [name]})
        return clusters_internal


    # Example usage
    names = ["Alice", "Alicia", "Alise", "Alyce", "Bob", "Bobb"]
    clusters = cluster_names(names)
    for i, cluster in enumerate(clusters, 1):
        print(f"Cluster {i}: {cluster['names']}")

    data = [
        "Arvind Kathmandu",
        "Arvind Kathmands",
        "Arbind Kathmandu",
        "Arvinds Kathmandu",
        "Arveen Kathmandu",
        "Arvins Kathmandu",
        "Arvind Kathmandu Nepal",
        "Abhishek Pokhara",
        "Abhisheks Pokhara",
        "Abhishek1 Pokhara",
        "Abhishek2 Pokhara",
        "Abhishek3 Pokhara"
    ]

    clusters_data = cluster_names(data)
    for i, cluster in enumerate(clusters_data, 1):
        print(f"Cluster {i}: {cluster['names']}")

Output :

    Cluster 1: ['Alice', 'Alicia', 'Alise', 'Alyce']
    Cluster 2: ['Bob', 'Bobb']
    Cluster 1: ['Arvind Kathmandu', 'Arvind Kathmands', 'Arbind Kathmandu', 'Arvinds Kathmandu', 'Arveen Kathmandu', 'Arvins Kathmandu', 'Arvind Kathmandu Nepal']
    Cluster 2: ['Abhishek Pokhara', 'Abhisheks Pokhara', 'Abhishek1 Pokhara', 'Abhishek2 Pokhara', 'Abhishek3 Pokhara']
	https://stackoverflow.com/questions/78186018/fuzzy-logic-to-match-the-records-in-a-dataframe/78188853#78188853

	I have taken inspiration from this blogpost to write the following code.

	https://leons.im/posts/a-python-implementation-of-simhash-algorithm/

	The `cluster_names` function just clusters the strings within the list based on the `cluster_threshold` value. You can tweak this value to get good results. You can also play around with `shingling_width` in `name_to_features`. You can create features of width=2,3,4,5 and so on and concatenate together.

	Once you are satistifed your with your clusters, then you can further do `fuzzywuzzy` (this library has been renamed to `thefuzz`) matching to find more exact matches.

	https://github.com/seatgeek/thefuzz


	First install `simhash` python library, then run the following code.

	`pip install simhash`


	from simhash import Simhash


	def simhash_distance(hash1, hash2):
	return hash1.distance(hash2)


	def name_to_features(name, shingling_width=2):
	name = name.lower()
	return [name[i:i + shingling_width] for i in range(len(name) - shingling_width + 1)]


	def cluster_names(names_list, cluster_threshold=20):
	clusters_internal = []
	name_hashes = [(name, Simhash(name_to_features(name))) for name in names_list]

	for name, hash_val in name_hashes:
	found_cluster = False
	for cluster_ele in clusters_internal:
	if simhash_distance(cluster_ele['centroid'], hash_val) <= cluster_threshold:
	cluster_ele['names'].append(name)
	found_cluster = True
	break
	if not found_cluster:
	clusters_internal.append({'centroid': hash_val, 'names': [name]})
	return clusters_internal


	# Example usage
	names = ["Alice", "Alicia", "Alise", "Alyce", "Bob", "Bobb"]
	clusters = cluster_names(names)
	for i, cluster in enumerate(clusters, 1):
	print(f"Cluster {i}: {cluster['names']}")

	data = [
	"Arvind Kathmandu",
	"Arvind Kathmands",
	"Arbind Kathmandu",
	"Arvinds Kathmandu",
	"Arveen Kathmandu",
	"Arvins Kathmandu",
	"Arvind Kathmandu Nepal",
	"Abhishek Pokhara",
	"Abhisheks Pokhara",
	"Abhishek1 Pokhara",
	"Abhishek2 Pokhara",
	"Abhishek3 Pokhara"
	]

	clusters_data = cluster_names(data)
	for i, cluster in enumerate(clusters_data, 1):
	print(f"Cluster {i}: {cluster['names']}")

	Output :

	Cluster 1: ['Alice', 'Alicia', 'Alise', 'Alyce']
	Cluster 2: ['Bob', 'Bobb']
	Cluster 1: ['Arvind Kathmandu', 'Arvind Kathmands', 'Arbind Kathmandu', 'Arvinds Kathmandu', 'Arveen Kathmandu', 'Arvins Kathmandu', 'Arvind Kathmandu Nepal']
	Cluster 2: ['Abhishek Pokhara', 'Abhisheks Pokhara', 'Abhishek1 Pokhara', 'Abhishek2 Pokhara', 'Abhishek3 Pokhara']