avibryant/impala-hll.md

## impala-hll.md

      
    Raw
  

              impala-hll.md
            
          
    Recent versions of Cloudera's Impala added NDV, a "number of distinct values" aggregate function that uses the HyperLogLog algorithm to estimate this number, in parallel, in a fixed amount of space.
This can make a really, really big difference: in a large table I tested this on, which had roughly 100M unique values of mycolumn, using NDV(mycolumn) got me an approximate answer in 27 seconds, whereas the exact answer using count(distinct mycolumn) took ... well, I don't know how long, because I got tired of waiting for it after 45 minutes.
It's fun to note, though, that because of another recent addition to Impala's dialect of SQL, the fnv_hash function, you don't actually need to use NDV; instead, you can build HyperLogLog yourself from mathematical primitives.
HyperLogLog hashes each value it sees, and then assigns them to a bucket based on the low order bits of the hash. It's common to use 1024 buckets, so we can get the bucket by using a bitwise & with 1023:
select
   (fnv_hash(mycolumn) & 1023) bucket
from
  mytable
Here's some sample output:
+--------+
| bucket |
+--------+
| 837    |
| 566    |
| 642    |
| 674    |
| 486    |
+--------+

HyperLogLog's estimate is based on the number of leading zero bits in the hash; it needs this value (plus one) for each bucket. It assumes an unsigned 32-bit hash, whereas fnv_hash is giving us a signed 64-bit value, so we'll first mask by 2^32-1. We can use the base 2 logarithm to find the highest non-zero bit, then get the number of leading zeros by subtracting from 32. Let's call that value z:
select
    (fnv_hash(mycolumn) & 1023) bucket,
    (32 - floor(log2(fnv_hash(mycolumn) & 4294967295))) z
from
   mytable
+--------+---+
| bucket | z |
+--------+---+
| 599    | 1 |
| 574    | 2 |
| 43     | 1 |
| 360    | 3 |
| 644    | 3 |
+--------+---+

Actually, all we care about is the maximum value of this for each bucket, so we can group by bucket and use max:
select
    (fnv_hash(mycolumn) & 1023) bucket,
    max(32 - floor(log2(fnv_hash(mycolumn) & 4294967295))) z
from
   mytable
group by
   bucket
+--------+----+
| bucket | z  |
+--------+----+
| 283    | 22 |
| 977    | 17 |
| 630    | 16 |
| 208    | 15 |
| 315    | 20 |
+--------+----+

The estimate itself is derived from the harmonic mean of these bucket values. We can move the bucket creation to a nested subquery, and then use the outer query to sum up the buckets and multiply by some constants (this may seem a bit magical and arbitrary; maybe at some point I'll edit this gist to explain, but for now I'll just wave my hands and say Because Math).
select
  floor((0.721 * 1024 * 1024) / (sum(pow(2,z*-1)) + 1024 - count(*))) estimate
  from
    (select
      (fnv_hash(mycolumn) & 1023) bucket,
      max(32 - floor(log2(fnv_hash(mycolumn) & 4294967295))) z
      from mytable
      group by bucket) bucket_values
And... it gives a reasonable looking answer! In around 46 seconds; twice as slow as the builtin, but still perfectly respectable.
+----------+
| estimate |
+----------+
| 94485825 |
+----------+