Keeps a sketch/bitmap/buckets/registry with the values/uses stochastic average/harmonic mean
The HLL algorithm stores many estimators instead of one and averages the results. We could use different hash functions for this, but it would be too expensive. Instead we use only one and split the result into different parts.
For example, if your hashed value looks like 110100000100 and k = 3, you would use the 3 rightmost bits, 100, to tell you which register to update (100^2
= 4) and from the remaining bits, 110100000, you could take the longest run of zeroes (up to some max), which in this case is 5.
A good visualisation of it
In order to compute the number of distinct values in the stream you would just take the average of all of the m buckets:
Here is the average of the values R in all the buckets. But to get to the final HyperLogLog implementation, you still need...
The previous explanation is actually from the LogLog algorithm, which was then improved in the form of SuperLogLog and later HyperLogLog.
While experimenting with it Phillipe Flajolet and Marianne Durand, noticied that the distribution of values was usually skewed to the right and some outliers were present, that would affect the whole counting.
They tried some stuff, like eliminating the 30% largest values and thus SuperLogLog was born.
But that didn't seemed optimal so they started using the harmonic mean of all registry values, since it usually behaves quite nicely with extreme values.
The final result of the estimation formula and what we call HyperLogLog, thus is:
This module is a fork from a module by Optimizely, that has some issues fixed. It implements HyperLogLog in Node and runs on the latest versions.
Also Redis has an implementation of it that works quite well, while having a small memory footprint.