Data structures:
Hash table mapping tokens -> <document-count, count-min-sketch(docuemnt id -> term count)>
Hash table mapping sketch indexes -> heap(<document id, term count dictionary> sorted by document id)
To search:
- sum sketches for all terms in the query
- find indexes of top k values in result sketch
- look up actual document ids and term counts for those indexes
- do regular tf-idf rank on that restricted set
The sketches should be small enough that they can fit on every node, and you can store the document heaps in a hash ring. Also, each sketch should be small enough that it fits in L2 cache.