bobpoekert/gist:8049579

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    Data structures:
Hash table mapping tokens -> <document-count, count-min-sketch(docuemnt id -> term count)>
Hash table mapping sketch indexes -> heap(<document id, term count dictionary> sorted by document id)

To search:

sum sketches for all terms in the query
find indexes of top k values in result sketch
look up actual document ids and term counts for those indexes
do regular tf-idf rank on that restricted set

The sketches should be small enough that they can fit on every node, and you can store the document heaps in a hash ring. Also, each sketch should be small enough that it fits in L2 cache.