Skip to content

Instantly share code, notes, and snippets.

@arnehuang
Created April 11, 2018 13:45
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save arnehuang/58c1c3f1eddca1bf3ced542cb674d5ea to your computer and use it in GitHub Desktop.
Save arnehuang/58c1c3f1eddca1bf3ced542cb674d5ea to your computer and use it in GitHub Desktop.
matching two addresses via ngram one hot encoded cosine similarities
from nltk.util import ngrams
from sklearn.metrics.pairwise import cosine_similarity
import string
import itertools
vector_of_possibilities = [''.join(i) for i in itertools.product(string.ascii_lowercase + string.digits, repeat=3)]
def get_3grams(astring):
newstring = [achar for achar in astring.lower() if achar.isalnum()]
return [''.join(agram) for agram in ngrams(newstring, 3)]
def string_to_vec(astring):
vec = [0] * len(vector_of_possibilities)
for agram in get_3grams(astring):
for i, apossibility in enumerate(vector_of_possibilities):
if agram == apossibility:
vec[i] = 1
return [vec]
print(cosine_similarity(string_to_vec('48 Leabrooks Road Alfreton'),
string_to_vec(u'48 Leabrooks RdSomercotesAlfretonDE55 4HB')))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment