Minimizing the redundancy in Golomb Coded Sets
A Golomb Coded Set (GCS) is a set of N distinct integers within the range [0..MN-1], whose order does not matter, and stored by applying a Golomb-Rice coder with parameter B to the differences between subsequent elements after sorting. When the integers are hashes of elements from a set, this is an efficient encoding of a probabilistic data structure with false positive rate 1/M. It is asymptotically 1 / log(2) (around 1.44) times more compact than Bloom filters, but harder to update or query.
The question we try to answer in this document is what combinations of B and M minimize the resulting size of the filter, and find that the common suggestion B = log2(M) is not optimal.
Size of a Golomb Coded Set
To determine the size of a Golomb coding of a set, we model the differences between subsequent elements after sorting as a geometric distribution with p = 1 / M. This is a good approximation if the size of the set is large enou