Unifying Sketch Monoids
As I discussed in Algebra for Analytics, many sketch monoids, such as Bloom filters, HyperLogLog, and Count-min sketch, can be described as a hashing (projection) of items into a sparse space, then using two different commutative monoids to read and write respectively. Finally, the read monoids always have the property that (a + b) <= a, b and the write monoids has the property that (a + b) >= a, b.
- Note how similar CMS and Bloom filters are. The difference: bloom hashes k times onto the same space, CMS hashes k times onto a k orthogonal subspaces. Why the difference? Imagine a fixed space bloom that hashes onto k orthogonal spaces, or an overlapping CMS that hashes onto k * m length space. How do the error asymptotics change?
- CMS has many query modes (dot product, etc...) can those generalize to other sketchs (HLL, Bloom)?
- What other sketch or non-sketch algorithms can be expressed in this dual monoid representation? What other monoids have the right properties (non-increasing/non-decreasing). Set union/intersection look like candidates. What sketch monoids themselves meet this property, and how can they be used (HyperLogLog, looks like it could be the value in a CMS. What error bounds can we get)?
- Can we reuse the error bounds across different generalizations? Do we need some metric space and some triange inequality in the monoid, e.g. |a + b| <= |a|, |b| etc...