Skip to content

Instantly share code, notes, and snippets.

@avibryant
Last active June 22, 2020 06:39
Show Gist options
  • Star 14 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save avibryant/8275649 to your computer and use it in GitHub Desktop.
Save avibryant/8275649 to your computer and use it in GitHub Desktop.

Recent versions of Cloudera's Impala added NDV, a "number of distinct values" aggregate function that uses the HyperLogLog algorithm to estimate this number, in parallel, in a fixed amount of space.

This can make a really, really big difference: in a large table I tested this on, which had roughly 100M unique values of mycolumn, using NDV(mycolumn) got me an approximate answer in 27 seconds, whereas the exact answer using count(distinct mycolumn) took ... well, I don't know how long, because I got tired of waiting for it after 45 minutes.

It's fun to note, though, that because of another recent addition to Impala's dialect of SQL, the fnv_hash function, you don't actually need to use NDV; instead, you can build HyperLogLog yourself from mathematical primitives.

HyperLogLog hashes each value it sees, and then assigns them to a bucket based on the low order bits of the hash. It's common to use 1024 buckets, so we can get the bucket by using a bitwise & with 1023:

select
   (fnv_hash(mycolumn) & 1023) bucket
from
  mytable

Here's some sample output:

+--------+
| bucket |
+--------+
| 837    |
| 566    |
| 642    |
| 674    |
| 486    |
+--------+

HyperLogLog's estimate is based on the number of leading zero bits in the hash; it needs this value (plus one) for each bucket. It assumes an unsigned 32-bit hash, whereas fnv_hash is giving us a signed 64-bit value, so we'll first mask by 2^32-1. We can use the base 2 logarithm to find the highest non-zero bit, then get the number of leading zeros by subtracting from 32. Let's call that value z:

select
    (fnv_hash(mycolumn) & 1023) bucket,
    (32 - floor(log2(fnv_hash(mycolumn) & 4294967295))) z
from
   mytable
+--------+---+
| bucket | z |
+--------+---+
| 599    | 1 |
| 574    | 2 |
| 43     | 1 |
| 360    | 3 |
| 644    | 3 |
+--------+---+

Actually, all we care about is the maximum value of this for each bucket, so we can group by bucket and use max:

select
    (fnv_hash(mycolumn) & 1023) bucket,
    max(32 - floor(log2(fnv_hash(mycolumn) & 4294967295))) z
from
   mytable
group by
   bucket
+--------+----+
| bucket | z  |
+--------+----+
| 283    | 22 |
| 977    | 17 |
| 630    | 16 |
| 208    | 15 |
| 315    | 20 |
+--------+----+

The estimate itself is derived from the harmonic mean of these bucket values. We can move the bucket creation to a nested subquery, and then use the outer query to sum up the buckets and multiply by some constants (this may seem a bit magical and arbitrary; maybe at some point I'll edit this gist to explain, but for now I'll just wave my hands and say Because Math).

select
  floor((0.721 * 1024 * 1024) / (sum(pow(2,z*-1)) + 1024 - count(*))) estimate
  from
    (select
      (fnv_hash(mycolumn) & 1023) bucket,
      max(32 - floor(log2(fnv_hash(mycolumn) & 4294967295))) z
      from mytable
      group by bucket) bucket_values

And... it gives a reasonable looking answer! In around 46 seconds; twice as slow as the builtin, but still perfectly respectable.

+----------+
| estimate |
+----------+
| 94485825 |
+----------+
@Pa2017
Copy link

Pa2017 commented Jul 13, 2017

I've just come across this posting recently. It would be interesting to have the information about the table structure and format of the table used in the initial test.

I've run a few tests myself with a table in Parquet format and even if performance was in some cases better with NDV, I could not get the huge difference in run times that avibryant has reported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment