Skip to content

Instantly share code, notes, and snippets.

@hassek
Forked from anonymous/test.py
Created April 13, 2012 20:22
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save hassek/2379917 to your computer and use it in GitHub Desktop.
Save hassek/2379917 to your computer and use it in GitHub Desktop.
comparing data by difference ratio
from difflib import SequenceMatcher
from mturk.models import Hit
from message.models import Email
email_ids = [m.email_id for m in Hit.objects.all()]
emails = [e.subject for e in Email.objects.filter(id__in=email_ids)]
res = {}
for e in emails[:100]:
f = False
if not res:
res[e] = [e]
continue
for k in res.keys():
pa = SequenceMatcher(None, e, k)
if pa.ratio() >= 0.5:
res.setdefault(k, [e])
res[k].append(e)
f = True
break
if not f:
res[e] = [e]
for l in res.values():
print l[0]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment