Skip to content

Instantly share code, notes, and snippets.

@bobpoekert
Last active December 31, 2015 21:49
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bobpoekert/8049579 to your computer and use it in GitHub Desktop.
Save bobpoekert/8049579 to your computer and use it in GitHub Desktop.

Data structures:

Hash table mapping tokens -> <document-count, count-min-sketch(docuemnt id -> term count)>
Hash table mapping sketch indexes -> heap(<document id, term count dictionary> sorted by document id)

To search:

  1. sum sketches for all terms in the query
  2. find indexes of top k values in result sketch
  3. look up actual document ids and term counts for those indexes
  4. do regular tf-idf rank on that restricted set

The sketches should be small enough that they can fit on every node, and you can store the document heaps in a hash ring. Also, each sketch should be small enough that it fits in L2 cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment