{{ message }}

Instantly share code, notes, and snippets.

# avibryant/impala-hll.md

Last active Jun 22, 2020

Recent versions of Cloudera's Impala added NDV, a "number of distinct values" aggregate function that uses the HyperLogLog algorithm to estimate this number, in parallel, in a fixed amount of space.

This can make a really, really big difference: in a large table I tested this on, which had roughly 100M unique values of `mycolumn`, using `NDV(mycolumn)` got me an approximate answer in 27 seconds, whereas the exact answer using `count(distinct mycolumn)` took ... well, I don't know how long, because I got tired of waiting for it after 45 minutes.

It's fun to note, though, that because of another recent addition to Impala's dialect of SQL, the `fnv_hash` function, you don't actually need to use NDV; instead, you can build HyperLogLog yourself from mathematical primitives.

HyperLogLog hashes each value it sees, and then assigns them to a bucket based on the low order bits of the hash. It's common to use 1024 buckets, so we can get the bucket by using a bitwise & with 1023:

```select
(fnv_hash(mycolumn) & 1023) bucket
from
mytable```

Here's some sample output:

``````+--------+
| bucket |
+--------+
| 837    |
| 566    |
| 642    |
| 674    |
| 486    |
+--------+
``````

HyperLogLog's estimate is based on the number of leading zero bits in the hash; it needs this value (plus one) for each bucket. It assumes an unsigned 32-bit hash, whereas fnv_hash is giving us a signed 64-bit value, so we'll first mask by `2^32-1`. We can use the base 2 logarithm to find the highest non-zero bit, then get the number of leading zeros by subtracting from 32. Let's call that value `z`:

```select
(fnv_hash(mycolumn) & 1023) bucket,
(32 - floor(log2(fnv_hash(mycolumn) & 4294967295))) z
from
mytable```
``````+--------+---+
| bucket | z |
+--------+---+
| 599    | 1 |
| 574    | 2 |
| 43     | 1 |
| 360    | 3 |
| 644    | 3 |
+--------+---+
``````

Actually, all we care about is the maximum value of this for each bucket, so we can group by bucket and use max:

```select
(fnv_hash(mycolumn) & 1023) bucket,
max(32 - floor(log2(fnv_hash(mycolumn) & 4294967295))) z
from
mytable
group by
bucket```
``````+--------+----+
| bucket | z  |
+--------+----+
| 283    | 22 |
| 977    | 17 |
| 630    | 16 |
| 208    | 15 |
| 315    | 20 |
+--------+----+
``````

The estimate itself is derived from the harmonic mean of these bucket values. We can move the bucket creation to a nested subquery, and then use the outer query to sum up the buckets and multiply by some constants (this may seem a bit magical and arbitrary; maybe at some point I'll edit this gist to explain, but for now I'll just wave my hands and say Because Math).

```select
floor((0.721 * 1024 * 1024) / (sum(pow(2,z*-1)) + 1024 - count(*))) estimate
from
(select
(fnv_hash(mycolumn) & 1023) bucket,
max(32 - floor(log2(fnv_hash(mycolumn) & 4294967295))) z
from mytable
group by bucket) bucket_values```

And... it gives a reasonable looking answer! In around 46 seconds; twice as slow as the builtin, but still perfectly respectable.

``````+----------+
| estimate |
+----------+
| 94485825 |
+----------+
``````

### Tagar commented Feb 5, 2016

 ~11.5 billion records in Impala nested array (.amft), parquet table: Using NDV: select ndv(concat_ws('|', part_code,cast(activity_dollars as string))) from fact.amft got: 9,460,217 in: 5.4 minutes. Using exact count distinct: select COUNT(DISTINCT part_code,activity_dollars) from fact.amft got: 9,298,947 in: 5.1 minutes. So for Impala above case ran faster with exact count distinct. ps. Reran Test 1 again (after Test 2 was done), to eliminate disk caching effects etc; got 5.6 minutes - still slower. I think it might have something with the fact that it is a parquet table, so it doesn't have to read whole data set. It just needs to read metadata from rowchunks / column groups. To use ndv() I had to pass it one argument, convert to string, and then Impala actually had to read whole dataset.

### Pa2017 commented Jul 13, 2017 • edited

 I've just come across this posting recently. It would be interesting to have the information about the table structure and format of the table used in the initial test. I've run a few tests myself with a table in Parquet format and even if performance was in some cases better with NDV, I could not get the huge difference in run times that avibryant has reported.