Skip to content

Instantly share code, notes, and snippets.

@dustinboswell
Last active December 2, 2021 19:55
Show Gist options
  • Save dustinboswell/63ab1c409cb1ce2d6494707926b8c4c6 to your computer and use it in GitHub Desktop.
Save dustinboswell/63ab1c409cb1ce2d6494707926b8c4c6 to your computer and use it in GitHub Desktop.
Rough code for comparing document similarity with MinHash
def minhash(text, window=25): # assume len(text) > 50
hashes = [murmurhash(text[i:i+window]) for i in range(len(text)-window+1)]
return set(sorted(hashes)[0:20])
def similarity(text1, text2):
hashes1 = minhash(text1)
hashes2 = minhash(text2)
return len(hashes1 & hashes2) / len(hashes1)
A = "one two three four five six seven eight nine ten eleven twelve thirteen fourteen fifteen sixteen seventeen eighteen"
B = "one two three four 5 6 7 8 9 ten eleven twelve thirteen fourteen fifteen sixteen seventeen eighteen"
C = " 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18"
similarity(A, A) # 1.0
similarity(A, B) # 0.6
similarity(B, C) # 0.1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment