Skip to content

Instantly share code, notes, and snippets.

@mhluongo
Created March 12, 2012 19:25
Show Gist options
  • Save mhluongo/2024122 to your computer and use it in GitHub Desktop.
Save mhluongo/2024122 to your computer and use it in GitHub Desktop.
An implementation of a "soft Jaccard" set similarity measure
>>> import jellyfish
>>> from soft_jaccard import soft_jaccard
>>> c1 = set(['CL Isbell','C. L. Isbell'])
>>> c2 = set(['C Isbell','C Isbell, Jr.'])
>>> soft_jaccard(c1, c2, jellyfish.jaro_winkler)
0.75848950260673509
def soft_jaccard(a, b, sim_func):
"""
Return a measure of two sets' similarity, based on the similarity of
their elements.
Arguments:
a - a set or otherwise uniqified list of strings
b - a set or otherwise uniqified list
sim_func - a function that takes two str args and returns a measure of
their similarity, a float in [0,1]
"""
intersection_length = sum(sum(sim_func(i, j) for j in b)/float(len(b)) for i in a)
return float(intersection_length)/(len(a) + len(b) - intersection_length)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment