Skip to content

Instantly share code, notes, and snippets.

Created April 13, 2012 20:22
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save anonymous/2379915 to your computer and use it in GitHub Desktop.
Save anonymous/2379915 to your computer and use it in GitHub Desktop.
from difflib import SequenceMatcher
from mturk.models import Hit
from message.models import Email
email_ids = [m.email_id for m in Hit.objects.all()]
emails = [e.subject for e in Email.objects.filter(id__in=email_ids)]
res = {}
for e in emails[:100]:
f = False
if not res:
res[e] = [e]
continue
for k in res.keys():
pa = SequenceMatcher(None, e, k)
if pa.ratio() >= 0.5:
res.setdefault(k, [e])
res[k].append(e)
f = True
break
if not f:
res[e] = [e]
lists = [r for r in res.values() if len(r) > 1]
for l in lists:
print l[0]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment