Minimizing the redundancy in Golomb Coded Sets
A Golomb Coded Set (GCS) is a set of N distinct integers within the range [0..MN-1], whose order does not matter, and stored by applying a Golomb-Rice coder with parameter B to the differences between subsequent elements after sorting. When the integers are hashes of elements from a set, this is an efficient encoding of a probabilistic data structure with false positive rate 1/M. It is asymptotically 1 / log(2) (around 1.44) times more compact than Bloom filters, but harder to update or query.
The question we try to answer in this document is what combinations of B and M minimize the resulting size of the filter, and find that the common suggestion B = log2(M) is not optimal.
Size of a Golomb Coded Set
To determine the size of a Golomb coding of a set, we model the differences between subsequent elements after sorting as a geometric distribution with p = 1 / M. This is a good approximation if the size of the set is large enough.
The Golomb-Rice encoding of a single difference d consists of:
- A unary encoding of l = floor(d / 2B), so l 1 bits plus a 0 bit.
- The lower B bits of d.
In other words, its total length is B + 1 + floor(d / 2B). To compute the expected value of this expression, we start with B + 1, and add 1 for each k for which d ≥ 2Bk. In a geometric distribution with p = 1 / M, P(d ≥ 2Bk) = (1 - 1/M)2Bk. Thus, the total expected size becomes B + 1 + ∑((1 - 1/M)2Bk for k=1...∞). This sum is a geometric series, and its limit is B + 1 / (1 - (1 - 1/M)2B). It can be further approximated by B + 1 / (1 - e-2B/M).
For M = 220 and B = 20, it is 21.58197 while a simulation of a GCS with N=10000 gives us 21.58187. Most of the inaccuracy is due to the fact that the differences between subsequent elements in a sorted set are not exactly independent samples from a single distribution.
Minimizing the GCS size
For a given value M (so a given false positive rate), we want to minimize the size of the GCS.
In other words, we need to see where the derivative of the expression above is 0. That derivative is 1 - log(2)e2B/M2B / (M(e2B/M-1)2), and it is zero when log(2)e2B/M2B = M(e2B/M-1)2. If we substitute r = 2B/M, we find that log(2)err = (er-1)2 must hold, or 1 + log(√2)r = cosh(r), leading to the solution r = 2B/M = 0.6679416.
In other words, we find that the set size is minimized when B = log2(M) - 0.5822061, or M = 1.497137 2B. These numbers are only exact for the approximation made above, but simulating the size for actual random sets confirms that these are close to optimal.
Of course, B can only be chosen to be an integer. To find the range of M values for which a given B value is optimal, we need to find the switchover point. At this switchover point, for a given M, B and B+1 result in the same set size. If we solve B + 1 + 1 / (1 - exp(-2B/M)) = B + 1 + 1 / (1 - exp(-2B+1/M)), we find M = 2B / log((1 + √5)/2). This means a given B value is optimal in the range M = 1.039043 2B ... 2.078087 2B.
Surprisingly 2B itself is outside that range. This means that if M is chosen as 2Q with integer Q, the optimal value for B is not Q but Q-1.
Compared with entropy
A next question is how close we are to optimal.
To answer this, we must find out how much entropy is there in a set of N uniformly randomly integers in range [0..MN-1].
The total number of such possible sets, taking into account that the order does not matter, is simply (MN choose N). This can be written as ((MN-N+1)(MN-N+2)...(MN)) / N!. When M is large enough, this can be approximated by (MN)N / N!. Using Stirling's approximation for N! gives us (eM)N / √(2πN).
As each of these sets is equally likely, information theory tells us an encoding for a randomly selected set must use at least log2((eM)N / √(2πN)) bits of data, or at least log2((eM)N / √(2πN))/N per element. This equals log2(eM) - log2(2πN)/(2N). For large N, this expression approaches log2(eM).
For practical numbers, this approximation is pretty good. When picking M = 220, this gives us 21.44270 bits per element, while the exact value of log2(MN choose N)/N for N = 10000 is 21.44190.
If we look at the ratio between our size formula B + 1 / (1 - e-2B/M) and the entropy log2(eM), evaluated for M = 1.497137 2B, we get (B + 2.052389)/(B + 2.024901). For B = 20 that value is 1.001248, which means the approximate GCS size is less than 0.125% larger than theoretically possible. To contrast, if M = 2M, the redundancy is around 0.65%.
When you have freedom to vary the false positive rate for your GCS, picking M = 1.497137 2Q for an integer Q will give you the best bang for the buck. In this case, use B = Q and you end up with a set only slightly larger than theoretically possible for that false positive rate.
When you don't have that freedom, use B = floor(log2(M) - 0.055256).