Skip to content

Instantly share code, notes, and snippets.

@dustinboswell
Last active December 2, 2021 20:15
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save dustinboswell/71c07324965783190a24fb4fb677ed70 to your computer and use it in GitHub Desktop.
Save dustinboswell/71c07324965783190a24fb4fb677ed70 to your computer and use it in GitHub Desktop.
Computing shingleprints for a document
def min_max_hashes(text, window=60):
hashes = [murmurhash(text[i:i+window]) for i in range(len(text)-window+1)]
return [min(hashes), max(hashes)]
def shingleprints(text):
min1, max1 = min_max_hashes(text[0:len(text)/2])
min2, max2 = min_max_hashes(text[len(text)/2:])
# combine pairs, using your favorite hash-value combiner
return [hash_combine(min1, min2),
hash_combine(min1, max2),
hash_combine(max1, min2),
hash_combine(max1, max2)]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment