Skip to content

Instantly share code, notes, and snippets.

@foliveira
Last active March 1, 2017 13:39
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save foliveira/480776705d044676e03556205146f377 to your computer and use it in GitHub Desktop.
Save foliveira/480776705d044676e03556205146f377 to your computer and use it in GitHub Desktop.
Cardinality Estimation

Why?

Counting unique elements of a finite set usually requires linear time/space; this means the bigger the set of unique elements the more time/space it will take us to keep an exact count of unique elements.

A good example of this is the usage of an HashSet - we add an element to a Set (to ensure uniqueness), it gets hashed and stored:

  • In case it's the first one, it is added and that's it;

  • In case it isn't (in case of an hash collision, duplicate value, etc) various things can happen; most common implementations use Linear Probing

A Space Problem

If we want to get an exact cardinality count we need linear space.

But what if you can't fit all the values in memory?

You're out of luck... or are you? (Spoiler: It's possible if you can sacrifice some accuracy)

What?

50% of values in base 2 start with 1

25% of values in base 2 start with 11

12.5% of values in base 2 start with 111

Real life example:

If you told me that you flipped a coin and your longest run of heads was 3, I'd say that you probably didn't throw the coin that much; while if you told me that you got heads 20 times in a row, you probably did a fair amount of throws.

Mathematics to the rescue:

Bit-pattern observables: these are based on certain patterns of bits occurring at the beginning of the (binary) S-values. For instance, observing in the stream S at the beginning of a string a bit- pattern O^{\rho-1}1 is more or less a likely indication that the cardinality n of S is at least 2^\rho.

source

Definition. p(x) is the number of 1s in the binary representation of x.

Definition. r(x) is the number of trailing 1s in the binary representation of x - this is the position of the rightmost 0

Definition. R(x) = 2r(x)

So we can keep a summary (sketch) of all the elements in our set, by using a bitwise or with R(x) and then we use the position of the righmost 0 to estimate N.

Easy?

Not quite.

And we still need to introduce a bias correction factor. I won't get into it, but this research paper by Kirschenhofer, Prodinger, and Szpankowski does.

This value is .77351 and by dividing the result of R(sketch) by it, we get a rough estimate of the set.

How?

Keeps a sketch/bitmap/buckets/registry with the values/uses stochastic average/harmonic mean

Stochastic what?

The HLL algorithm stores many estimators instead of one and averages the results. We could use different hash functions for this, but it would be too expensive. Instead we use only one and split the result into different parts.

Still not getting it

For example, if your hashed value looks like 110100000100 and k = 3, you would use the 3 rightmost bits, 100, to tell you which register to update (100^2 = 4) and from the remaining bits, 110100000, you could take the longest run of zeroes (up to some max), which in this case is 5.

A good visualisation of it

In order to compute the number of distinct values in the stream you would just take the average of all of the m buckets:

Here is the average of the values R in all the buckets. But to get to the final HyperLogLog implementation, you still need...

Harmonic mean

The previous explanation is actually from the LogLog algorithm, which was then improved in the form of SuperLogLog and later HyperLogLog.

While experimenting with it Phillipe Flajolet and Marianne Durand, noticied that the distribution of values was usually skewed to the right and some outliers were present, that would affect the whole counting.

They tried some stuff, like eliminating the 30% largest values and thus SuperLogLog was born.

But that didn't seemed optimal so they started using the harmonic mean of all registry values, since it usually behaves quite nicely with extreme values.

The final result of the estimation formula and what we call HyperLogLog, thus is:

Implementation

This module is a fork from a module by Optimizely, that has some issues fixed. It implements HyperLogLog in Node and runs on the latest versions.

Also Redis has an implementation of it that works quite well, while having a small memory footprint.

Testing it

Data sources

I'm using some old logs from a NASA web server that are available here and here.

Each line contains a logged request with the following format:

n868740.ksc.nasa.gov - - [01/Aug/1995:12:19:45 -0400] "GET /images/MOSAIC-logosmall.gif HTTP/1.0" 200 363

We'll be using the request path as the unique key of this set.

Getting a precise count

To get a precise count of the unique requests present in the data source I'm using the following:

$ cat corpus/access_log_Jul95 | egrep -o "\"GET .*?\"" | sort -u | wc -l 

Which outputs the value 22165 and (on my machine) takes around 8 seconds to execute.

Getting an estimate using HyperLogLog

For this I'm using an HyperLogLog with 2^16 bytes, reading the corpus file line by line and matching each request line to the same regexp seen above. In the end I get the count and relative error, for demonstration purposes:

Count: 22144
Relative Error: 0.40625%

So we achieved a real error of 0.094%, while only using 64KB of storage.

If we were using a HashSet, using 128-bit hashes, the above set would need at least (128 * 22165) bytes (~355KB).

This might not seem much, but we're working with a really small set of values and the difference in memory usage is already 5.5x greater.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment