Skip to content

Instantly share code, notes, and snippets.

@miguno
Last active November 9, 2022 14:58
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save miguno/f4882d0b40f6ef2368c9705a9a277793 to your computer and use it in GitHub Desktop.
Save miguno/f4882d0b40f6ef2368c9705a9a277793 to your computer and use it in GitHub Desktop.
HyperLogLog (HLL) experiments: register bits, register counts, estimation error, error bounds
# Runs HLL experiments, using randomly generated strings as input data.
#
# Requirements
# ============
#
# # https://github.com/ascv/HyperLogLog
# $ pip install HLL
#
#
# How to use
# ==========
#
# $ python3 hll.py
#
#
# References
# ==========
# https://research.neustar.biz/2012/12/17/hll-intersections-2/
# http://dsinpractice.com/2015/09/07/counting-unique-items-fast-unions-and-intersections/
# https://blog.process-one.net/cardinality-estimation/
# https://blog.demofox.org/2015/03/09/hyperloglog-estimate-unique-value-counts-like-the-pros/
# http://druid.io/blog/2014/02/18/hyperloglog-optimizations-for-real-world-systems.html
# https://github.com/ascv/HyperLogLog/issues/9
from HLL import HyperLogLog
import math
import random
import string
expCardinality = 100000
# From: https://stackoverflow.com/a/2257449/1743580
def random_string(size=16, chars=string.ascii_uppercase + string.digits):
return ''.join(random.choice(chars) for _ in range(size))
# This is a cryptographically more secure variant of `random_string`.
# From: https://stackoverflow.com/a/2257449/1743580
def secure_random_string(size=16, chars=string.ascii_uppercase + string.digits):
return ''.join(random.SystemRandom().choice(chars) for _ in range(size))
print("HyperLogLog (HLL) experiments")
print("=============================")
print("True (expected) cardinality: %d" % expCardinality)
print(" 68% 95% 99% ")
print("+---------------+----------------+--------------------+--------------+--------+--------+--------+")
print("| Register bits | Register count | Estim. cardinality | Estim. error | σ | 2σ | 3σ | Within 2σ?")
print("+---------------+----------------+--------------------+--------------+--------+--------+--------+")
# log2m is the variable name used by the HLL implementation of https://github.com/addthis/stream-lib
# = k (HLL python module)
# = number of HLL register bits (aka bit indices)
for log2m in range(2, 17):
hll = HyperLogLog(log2m)
registers = math.pow(2, log2m)
for x in range(expCardinality):
hll.add(random_string())
estimationError = math.fabs((expCardinality - hll.cardinality()) / expCardinality * 1.0)
# standard error = sigma = σ = 68% of cases (assumption: normal distribution)
sigma = (1.04 / math.sqrt(registers))
# 2σ = 95% of cases
sigma2 = 2 * sigma
# 3σ = 99% of cases
sigma3 = 3 * sigma
withinErrorBounds = ""
if estimationError < sigma2:
withinErrorBounds = "✔"
print("| %2d | %8d | %9.1f | %3.4f | %3.4f | %3.4f | %3.4f | %s" % (log2m, registers, hll.cardinality(), estimationError, sigma, sigma2, sigma3, withinErrorBounds))
print("+---------------+----------------+--------------------+--------------+--------+--------+--------+")
@miguno
Copy link
Author

miguno commented Sep 3, 2018

Example output:

$ python3 hll.py
HyperLogLog (HLL) experiments
=============================
True (expected) cardinality: 100000
                                                                         65%      95%      99%
+---------------+----------------+--------------------+--------------+--------+--------+--------+
| Register bits | Register count | Estim. cardinality | Estim. error |      σ |     2σ |     3σ | Within 2σ?
+---------------+----------------+--------------------+--------------+--------+--------+--------+
|             2 |              4 |            88245.8 |       0.1175 | 0.5200 | 1.0400 | 1.5600 | ✔
+---------------+----------------+--------------------+--------------+--------+--------+--------+
|             3 |              8 |           156811.9 |       0.5681 | 0.3677 | 0.7354 | 1.1031 | ✔
+---------------+----------------+--------------------+--------------+--------+--------+--------+
|             4 |             16 |           114267.6 |       0.1427 | 0.2600 | 0.5200 | 0.7800 | ✔
+---------------+----------------+--------------------+--------------+--------+--------+--------+
|             5 |             32 |            88714.8 |       0.1129 | 0.1838 | 0.3677 | 0.5515 | ✔
+---------------+----------------+--------------------+--------------+--------+--------+--------+
|             6 |             64 |           125665.7 |       0.2567 | 0.1300 | 0.2600 | 0.3900 | ✔
+---------------+----------------+--------------------+--------------+--------+--------+--------+
|             7 |            128 |           101963.7 |       0.0196 | 0.0919 | 0.1838 | 0.2758 | ✔
+---------------+----------------+--------------------+--------------+--------+--------+--------+
|             8 |            256 |           106862.4 |       0.0686 | 0.0650 | 0.1300 | 0.1950 | ✔
+---------------+----------------+--------------------+--------------+--------+--------+--------+
|             9 |            512 |           101763.8 |       0.0176 | 0.0460 | 0.0919 | 0.1379 | ✔
+---------------+----------------+--------------------+--------------+--------+--------+--------+
|            10 |           1024 |            97779.9 |       0.0222 | 0.0325 | 0.0650 | 0.0975 | ✔
+---------------+----------------+--------------------+--------------+--------+--------+--------+
|            11 |           2048 |            96757.6 |       0.0324 | 0.0230 | 0.0460 | 0.0689 | ✔
+---------------+----------------+--------------------+--------------+--------+--------+--------+
|            12 |           4096 |            99293.5 |       0.0071 | 0.0163 | 0.0325 | 0.0488 | ✔
+---------------+----------------+--------------------+--------------+--------+--------+--------+
|            13 |           8192 |            99920.5 |       0.0008 | 0.0115 | 0.0230 | 0.0345 | ✔
+---------------+----------------+--------------------+--------------+--------+--------+--------+
|            14 |          16384 |            99864.1 |       0.0014 | 0.0081 | 0.0163 | 0.0244 | ✔
+---------------+----------------+--------------------+--------------+--------+--------+--------+
|            15 |          32768 |            99744.0 |       0.0026 | 0.0057 | 0.0115 | 0.0172 | ✔
+---------------+----------------+--------------------+--------------+--------+--------+--------+
|            16 |          65536 |           100121.9 |       0.0012 | 0.0041 | 0.0081 | 0.0122 | ✔
+---------------+----------------+--------------------+--------------+--------+--------+--------+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment