Skip to content

Instantly share code, notes, and snippets.

@temoto
Created November 27, 2014 17:24
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save temoto/17fc80eec1967acc9da4 to your computer and use it in GitHub Desktop.
Save temoto/17fc80eec1967acc9da4 to your computer and use it in GitHub Desktop.
Shingle based fuzzy text comparison
def shinglize(s):
return frozenset(s[i:i + 3] for i in xrange(len(s) - 2))
def shingle_compare(sh1, string):
'''set(shingle), 'string' -> float 0..1
'''
sh2 = shinglize(string)
if not sh1 or not sh2:
return 0
inter = sh1 & sh2
return float(len(inter)) / max(len(sh1), len(sh2))
def example():
items = [] # open('file', 'rb') or cursor.fetchall() or else
s = raw_input()
needle_sh = shinglize(s)
matches = []
for item in items:
score = shingle_compare(needle_sh, item)
if score > 0.2:
matches.append((item, score))
matches.sort(key=lambda (item, score): (1 - weight, item))
print(matches[:10])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment