{{ message }}

Instantly share code, notes, and snippets.

# avibryant/impala-hll.md

Last active Jun 22, 2020

Recent versions of Cloudera's Impala added NDV, a "number of distinct values" aggregate function that uses the HyperLogLog algorithm to estimate this number, in parallel, in a fixed amount of space.

This can make a really, really big difference: in a large table I tested this on, which had roughly 100M unique values of `mycolumn`, using `NDV(mycolumn)` got me an approximate answer in 27 seconds, whereas the exact answer using `count(distinct mycolumn)` took ... well, I don't know how long, because I got tired of waiting for it after 45 minutes.

It's fun to note, though, that because of another recent addition to Impala's dialect of SQL, the `fnv_hash` function, you don't actually need to use NDV; instead, you can build HyperLogLog yourself from mathematical primitives.

HyperLogLog hashes each value it sees, and then assigns them to a bucket based on the low order bits of the hash. It's common to use 1024 buckets, so we can get the bucket by using a bitwise & with 1023:

```select
(fnv_hash(mycolumn) & 1023) bucket
from
mytable```

Here's some sample output:

``````+--------+
| bucket |
+--------+
| 837    |
| 566    |
| 642    |
| 674    |
| 486    |
+--------+
``````

HyperLogLog's estimate is based on the number of leading zero bits in the hash; it needs this value (plus one) for each bucket. It assumes an unsigned 32-bit hash, whereas fnv_hash is giving us a signed 64-bit value, so we'll first mask by `2^32-1`. We can use the base 2 logarithm to find the highest non-zero bit, then get the number of leading zeros by subtracting from 32. Let's call that value `z`:

```select
(fnv_hash(mycolumn) & 1023) bucket,
(32 - floor(log2(fnv_hash(mycolumn) & 4294967295))) z
from
mytable```
``````+--------+---+
| bucket | z |
+--------+---+
| 599    | 1 |
| 574    | 2 |
| 43     | 1 |
| 360    | 3 |
| 644    | 3 |
+--------+---+
``````

Actually, all we care about is the maximum value of this for each bucket, so we can group by bucket and use max:

```select
(fnv_hash(mycolumn) & 1023) bucket,
max(32 - floor(log2(fnv_hash(mycolumn) & 4294967295))) z
from
mytable
group by
bucket```
``````+--------+----+
| bucket | z  |
+--------+----+
| 283    | 22 |
| 977    | 17 |
| 630    | 16 |
| 208    | 15 |
| 315    | 20 |
+--------+----+
``````

The estimate itself is derived from the harmonic mean of these bucket values. We can move the bucket creation to a nested subquery, and then use the outer query to sum up the buckets and multiply by some constants (this may seem a bit magical and arbitrary; maybe at some point I'll edit this gist to explain, but for now I'll just wave my hands and say Because Math).

```select
floor((0.721 * 1024 * 1024) / (sum(pow(2,z*-1)) + 1024 - count(*))) estimate
from
(select
(fnv_hash(mycolumn) & 1023) bucket,
max(32 - floor(log2(fnv_hash(mycolumn) & 4294967295))) z
from mytable
group by bucket) bucket_values```

And... it gives a reasonable looking answer! In around 46 seconds; twice as slow as the builtin, but still perfectly respectable.

``````+----------+
| estimate |
+----------+
| 94485825 |
+----------+
``````

### Tagar commented Feb 5, 2016

~11.5 billion records in Impala nested array (.amft), parquet table:

1. Using NDV:
select ndv(concat_ws('|', part_code,cast(activity_dollars as string)))
from fact.amft
got: 9,460,217
in: 5.4 minutes.
2. Using exact count distinct:
select COUNT(DISTINCT part_code,activity_dollars)
from fact.amft
got: 9,298,947
in: 5.1 minutes.

So for Impala above case ran faster with exact count distinct.
ps. Reran Test 1 again (after Test 2 was done), to eliminate disk caching effects etc; got 5.6 minutes - still slower.

I think it might have something with the fact that it is a parquet table, so it doesn't have to read whole data set.