Further minimizing the redundancy in Neutrino filters (BIP158 / Golomb sets)
Last year saw the introduction of BIP158 and Neutrino wallet which use Golomb-coded sets to very efficiently encode which items are in bitcoin blocks.
The basic idea in BIP158 is that we have N items of interest (such as distinct scriptPubKeys) in a block, which we hash to integers in the interval [0,F-1] for some range F that is much larger than N. These are basically short hashes, but with an uneven number of bits. This set of hashes can be queried to check whether a candidate scriptPubKey might be in a block, by computing the hash for the candidate and checking if that integer is in the set. The false positive rate is approximately N/F, and can be therefore tuned by making F larger or smaller.
Golomb-Rice comes into play by providing a very size-efficient way of encoding the list of hashes (by sorting them and encoding just the differences with an entropy-optimized code), which is important for minimizing over-the-wire bandwidth and storage space. Some notation:
- N original number of items.
- Items each independently get hashed to an integer in the interval [0,F-1], for some range F, using a uniform hash function.
- M is the multiplier used to determine the range F = MN; importantly, M directly determines the false positive rate.
- P is the bit-parameter of the Golomb-Rice codes.
Pieter Wuille wrote up an excellent guideline on how to choose the parameters in order to approach the theoretical entropy limit for a given N and F. I'm going to expand on that theme and work out some of the curious quirks of these filters when M and P become small.
Note: I'm not recommending anyone change their implementation, this is just to see how far we can 'push the envelope'.
BIP158 filters are not sets but rather sorted lists. In particular, distinct items can have their hashes collide! Including the duplicate elements is wasteful due to how the filters are used, and it throws off the analysis somewhat:
- This means the number of distinct hashes is not always N but on average slightly lower.
- As a consequence, the false positive rate is not actually 1/M (N/F), but slightly lower.
- Subsequent items can be separated by delta=0, and so the deltas between elements in such a sorted list are not exactly given by a geometric distribution, not even in the large-N limit.
If duplicate hashes are removed, so that we have a set, then:
- In the large-N limit, the deltas between elements will be given by a geometric distribution.
- There will on average be fewer items to encode.
- The Golomb coding can be adjusted as it will never be necessary to encode a zero difference between subsequent hashes.
These effects are practically not very important beyond about M ≈ 1500 (P ≈ 10), and so for the P ≈ 20 filters in BIP158 specification it definitely won't make much difference (only about 1 ppm in the final size). However, let's say we're interested in optimizing things as much as possible. For small M, we can also take into account some other effects:
- For small M values, the theoretical limit of entropy falls noticeably below log2(eM).
- M doesn't need to be an integer -- only F and N need to be integers. For large N, any M value is valid and implementors only need to agree on how to round F = MN to an integer.
- When comparing the Golomb filter size to the entropy limit, we can optimize for the difference in size (measured in bits) or the ratio of sizes.
I have mentioned the "large-N" limit several times above. To be clear, this limit is taken with fixed M. I am going to be assuming this limit for the remainder of this document, however I will obtain exact expressions for any M.
For any given hash, we have a 1-1/F chance of missing a given one of the set items; repeated for every one of the N independent items, this means the true negative rate is (1-1/F)^N, and the false positive rate is 1-(1-1/(MN))^N. In the large N limit, the false positive rate (which we'll call x) is
x = 1 - exp(-1/M)
which is, as expected, about 1/M in the large-M limit, and about 1 in the small-M limit.
Likewise the expected number of items in the set (i.e., distinct hashes) is
N' = xF = xMN
The proper value for the entropy of a set of random numbers (obtained from a list of random numbers sorted, with duplicates removed) is not quite straight forward as the logarithm of the binomial coefficient, since the number of set items is itself random. In the large N limit we can however obtain asymptotic expressions. For each possible hash value (in the range 0..F-1), that hash can be set (probability x) or unset (probability 1-x). For large N, these F hash sites are effectively independent from each other and so the set entropy is the sum of per-site entropies:
S = F(- x log x - (1-x) log (1-x)) + O(log terms in N,M,F)
Hence the entropy per item in the large-N limit is given by
S/N = M(- x log x - (1-x) log (1-x)) [Equation 1]
The entropy of the sorted list (without distinct items removed) will be a higher value, but not important for this discussion. (Exercise to the reader: find an expression for entropy of sorted list in large-N limit.)
Expected size of Golomb-Rice filter for sorted list
In order to figure out the average Golomb encoding length, we need to know the distribution of differences. In the large N limit, the distribution of differences d for the sorted list is a tweaked geometric distribution (with a defect at d=0):
P(0) = 1-Mx P(d) = M(1-x)^(d-1), for d > 0
Using the above distribution we can calculate the filter size. In BIP158 we use a Golomb-Rice encoding of bit parameter P that starts encoding at d = 0
P+1 bits for 0 <= d < 2^P P+2 bits for 2^P <= d < 2*2^P P+3 bits for 2*2^P <= d < 3*2^P ...
avg bits per list item = P + 1 + (Mx/(1-x))( 1/(1-(1-1/x)^(2^P)) - 1 ) [Equation 2]
Since the number of list items is N, this quantity is also the filter bitsize divided by N.
Expected size of Golomb-Rice filter for set
For the set the distribution is purely geometric:
P(0) = 0 P(d) = x(1-x)^(d-1), for d > 0
For the set case, we use a Golomb-Rice encoding for which the first encodeable point is d=1. Taking the above geometric distribution,
avg bits per set item = (P + 1 / (1 - (1 - x)^(2^P))) (filter bit size)/N = xM (P + 1 / (1 - (1 - x)^(2^P))) [Equation 3]
Putting it together
In the following figures I've plotted Equations 1,2,3 from above: 1) the entropy limit, 2) the size of a BIP158 sorted-list filter, and 3) the size of an optimized Golomb set filter. The first figure shows the raw bits per original element, i.e., the total number of bits in the filter divided by N.
The second figure shows the filter sizes in ratio to the theoretical limit. It's interesting to see that for every M value, there exists an encoding that is at most 4% longer than the theoretical entropic limit! And for every P value, we can find an M where the encoding is less than 1% above the theoretical limit. Another curiosity is that for M = 1/log(2), around 1.44, the Golomb-Rice encoding for P=0 becomes perfect. This is because this situation corresponds to x = 0.5, meaning that the set cannot be compressed; this is then perfectly represented by a Golomb-Rice filter of P=0, as the encoding for P=0 turns out to be exactly the same as the bit pattern.
Finally, I've tabulated for each P value, which M value gives the optimal ratio of the entropy to the theoretical limit. In the infinite-M limit this converges to Wuille's result, however it does so surprisingly slowly.
Another interesting thing is that for M < 1.44, the filter's hashspace has more filled spaces than empty spaces. The distances between empty spaces are now geometrically distributed, which means it's possible to efficiently encode these 'over-full' sets, using Golomb codes to encode the distances between empty spaces. In a future gist I'm going to explore how a hyper-efficient block filter could be built around this concept.
|P||Optimal M/2P||Optimal filter size / entropy limit|