Unifying Sketch Monoids
As I discussed in Algebra for Analytics, many sketch monoids, such as Bloom filters, HyperLogLog, and Count-min sketch, can be described as a hashing (projection) of items into a sparse space, then using two different commutative monoids to read and write respectively. Finally, the read monoids always have the property that (a + b) <= a, b and the write monoids has the property that (a + b) >= a, b.
- Note how similar CMS and Bloom filters are. The difference: bloom hashes k times onto the same space, CMS hashes k times onto a k orthogonal subspaces. Why the difference? Imagine a fixed space bloom that hashes onto k orthogonal spaces, or an overlapping CMS that hashes onto k * m length space. How do the error asymptotics change?
- CMS has many query modes (dot product, etc...) can those generalize to other sketchs (HLL, Bloom)?
- What other sketch or non-sketch algorithms can be expressed in this dual monoid representation? What other monoids have the right properties (non-increasing/non-decreasing). Set union/intersection look like candidates. What sketch monoids themselves meet this property, and how can they be used (HyperLogLog, looks like it could be the value in a CMS. What error bounds can we get)?
- Can we reuse the error bounds across different generalizations? Do we need some metric space and some triange inequality in the monoid, e.g. |a + b| <= |a|, |b| etc...
With no math to support this but I believe hashing CMS K hashes into different spaces and not onto the same space has to do with the monoids used to write and subsequently read. The monoid in a CMS is a simple numeric addition while the read monoid is the Min. This is because we're looking for an upper bound on the occurrence of a value as opposed to simple existence like in the case of bloom filters. The point here is because of this, a hash collision for the same value in a CMS would be detrimental to the accuracy of the algorithm. Imagine a single item that just happens to hash to the same place k/2 times (just an example). That gives us an approximate upper bound of k/2 which is obviously very far off from the number 1, i.e., in isolation a single entry has negative impact on itself and increases the error rate for that entry. In comparison, a hash collision of multiple hashes for a single value in the case of a bloom filter does not affect the accuracy as a result I believe adding more key spaces to a bloom filter application would probably only add memory and not affect the accuracy. For example a bloom filter that uses 12 bits in a single space has 12 distinct locations where the hash value bit can land. Splitting that up by 3 rows, one for each hash gives us only 4 distinct bits (the modulo part of the algorithm). From the earlier point that collisions of different hashes for the same value are not detrimental to the accuracy of the BF for that value, splitting the memory space, at least the way I see it, only takes away from the accuracy with no added benefit. We're just wasting memory. Thoughts?