markblundeberg/golomb_part2.md

## golomb_part2.md

      
    Raw
  

              golomb_part2.md
            
          
    Progressive knockout block filters

In my last gist, I described how the BIP158 block filters are technically not quite optimal, though in an inconsequential way since the false positive rate has been made so low.
In this document I'm going to describe how it's possible to significantly beat the bandwidth requirements of BIP158 while also achieving a lower overall false positive rate. The idea is to prepare multiple independent filters per block, each of which which has a higher false positive rate. Practically this can give roughly a factor of two improvement over BIP158, but at the cost of additional complexity.
The problem

The BIP158 false positive rate was chosen with a particular size of wallet in mind -- somewhere around 1000 items (addresses) to watch in each block. There is a trade off between the overall size of the filter (which must be downloaded for every block), vs the overall size of the false positive blocks (that get occasionally downloaded).
The BIP158 parameter M=784931 and P=19 means that every item in a block adds 21 bits of information to the block filter, and the false positive rate is about 1/784931 per wallet item. The density of items varies depending on how the filters are made, and what kinds of transactions are in the blocks. But very roughly, the filters end up being around 2% of blocksize, meaning the average block item represents ~1050 bits of block space.
With a wallet of around 1000 items, the cumulative false positive rate is about 1000/784931 = 0.13% per block, i.e., one in every 785 blocks is falsely downloaded, making an average download 2.13%. The false positive rate could be increased by a factor of two, which would drop filter sizes by 5% but double the rate of false positive blocks to 0.25%, giving an average download of 2.15%. Conversely the false positive rate could be decreased by a factor of 2, which would increase filter size by 5% but only decrease the rate of false positive blocks to 0.06%, giving an average download of 2.16%. Both directions increase the average amount of data that needs to be downloaded -- this is the tradeoff.
For a wallet of 100 items, the block false positive rate is much lower than with 1000 items, and the average download is 2.01% of blockchain. So the the optimal filter size should have been a bit lower -- using M=98113 with only 18 bits per item, or 1.81%.
On the flip side, with a wallet of 10000 items, the BIP158 settings are also not ideal as we end up downloading about one in every 79 blocks as false positive, which means overall we download around 3% of the blockchain!
The idea

Instead of choosing a 1-in-a-784931 false positive rate, why not just make two filters of a one-in-6132 false positive rate? Most users could just download the first filter, and only if one of their wallet items happens to match the block, they then download the second filter to double check the items that matched on the first filter. Such a filter is much smaller (only 13.5 bits per blockitem!) and it would be rare to download the second filter (doubling the size). And to top it all off, the combined false positive rate per wallet item would be one in 36 million!
But if two filters are good, why not three filters with a one-in-191 rate (9.5 bits per blockitem)? Or four, five ...? To answer this question we can look at the entropy of a block filter in comparison to its discrimination power which is the false positive rate, plotted below. (As my previous gist explained, we need to be careful when making block filters of small M: the false positive rate is not 1/M but rather x = 1 - exp(-1/M), and we can optimize the encoding by performing a proper Golomb set encoding (removing duplicate items and not wasting code space on zero-deltas; and for x>0.5 instead encoding distance between non-elements). As shown in the gist, Golomb codes for any false positive rate can get within 4% of the theoretical entropy limit, and always within 1% if the false positive rate is rounded properly. So we can use the entropy as a good proxy for the expected overall filter size.)
I've called this idea 'progressive knockout block filters' (though a better term is welcome). Basically you knock out a fraction of your wallet items for each filter downloaded. After enough filters, you have either knocked out all the items (and don't need to download the block), or you can gain enough confidence that the block contains items of interest, and so download the block. By only conditionally downloading subsequent filters, we get significant bandwidth savings.

How many filters do I need to download?

For the remaining analysis I'm going to assume that the wallet was never active, and we are entirely concerned with eliminating false positives with as little information as possible.
For a single wallet item, each filter has a false positive rate of x meaning that the 'depth' K (i.e., number of filters that must be downloaded to find the first excluding filter) is a geometric distribution starting from 1, with parameter (1-x). Its probability mass function and cumulative distribution function are:
P_1(K = k) = x^(k-1) . (1-x)
P_1(K <= k) = 1 - x^k

For W wallet items, each item follows its own geometric distribution however we need to continue downloading entire filters until the last item gets extinguished. Thus, the distribution for the entire collection is governed by the maximum 'depth' out of all the wallet items. It's a well known fact that cumulative distribution function of the maximum of independent random variables can be found by simply multiplying the individual cumulative distribution functions. Thus, the required depth K for W wallet items has a non-geometric distribution but which is easy enough to describe:
P_W(K <= k) = (1 - x^k)^W

Below I've plotted the mean depth K for various M values, and various wallet sizes:

So how much do I have to download?

We know the size per filter (basically the same as the entropy), and the average number of filters that need to be downloaded. Multiplying these quantities together^* gives us the average number of bits that we need to download before all wallet items have been extinguished. Plotted as a figure:

Somewhat surprisingly, for almost all the wallet sizes under consideration we can do better than the 20.5 bits per item found in the BIP158 filters, by using smaller M values. We only start to need more bits when above about 10000 items, which is anyway where the BIP158 filters start to break down with a giant false positive rate and requiring massively more bits due to false positive downloads.
It's especially worthwhile to consider the 10-100 range of wallet sizes, which will represent most new wallets (with a gaplimit of 20). As can be seen, we can practically expect to reach a 7-8 bits per blockitem level.
^{* -- In fact when multiplying two random numbers, the mean of the product will also depend on the statistical variations. However we've been assuming the infinite-N limit so that the filter size will have no significant variance.}
What's going on for M<1 ???

Something strange happens in the above graph as we move to M values below 1 or so. Apparently, we can get even better and better compactness -- better than any large M value! I don't think this is an false artifact of the analysis, but rather represents a real (but impractical) savings.
Note that for M < 1.44, the false-positive rate actually becomes higher than 0.5, i.e., the filter sets are more filled than empty! Despite the abysmal false positive rates, however, the filters have become very small, with only fractions of a bit required per block item, per filter (see first figure). Apparently the entropy decreases fast enough to overcome the awful false positive rate.
As hinted at the end of my previous gist, Golomb codes also work just as well for these over-full sets -- instead of encoding the distance between items, you just encode the distance between non-items. The filter hashspaces can also be concatenated together, to achieve a very efficient encoding with negligible overhead due to crossing filter boundaries.
Getting real

Lowering M value does unfortunately come at a cost -- dozens, hundreds, thousands, or even more filters need to be downloaded to knock out all wallet items, as can be seen in the second figure. As an extreme example, we could use M = 0.5332 (x = 0.8467). Each filter only needs about 0.33 bits per block item, and for a typical wallet of 100 items we would only need to download 31.7 filters on average (10.5 bits per blockitem) to exclude a given block completely. However, a node would want to precompute 125 filters to reach a cumulative false positive rate of 1e-9. This only means 41 bits per blockitem needs to be indexed, but computing 125 filters is a lot of hashing!
Besides the additional computation work, multiple filters also have overhead. Each filter query has a certain level of network overhead. Also, commitment schemes will likely need to have separable commitments of the filters, which adds additional costs for inclusion proofs.
Realistic optimum: M=192

With blocks having an item count on the order of one thousand, M=192 seems to be the best choice (x = 0.00519, Golomb-encoded at P=7 with 9.03 bits per item, at only 0.3% above entropy limit), as it strikes a good balance between filter size and overhead. A 100-item wallet only needs to download 12.7 bits per blockitem on average (~1.4 filters), and nodes should compute and store 4 filters deep (36 bits per item) to reach a ~1e-9 depth. These 4 filters can be merkelized for commitement purposes. For 1000 block items the filter size is ~1130 bytes, and if a sha256 merkle inclusion proof is needed then this adds an additional 64 bytes overhead.
Whereas with BIP158 filters the 100-item wallet took 2% of block size to filter, now we're looking at 1.2%. That's even better than the 1.8% we were looking at if BIP158 had been optimized for this wallet size!
Only with much larger blocks (>> 1000 items) it may be worth considering the choice of M=12 (x=0.07996, Golomb-encoded at P=3 with 4.85 bits per block item, at only 0.5% more than entropy limit), but the savings are not that great. A 100-item wallet only needs to download 12.25 bits per blockitem on average (~2.5 filters), and nodes should compute and index 8 filters deep (39 bits per item) to reach a ~1e-9 depth. The filters are smaller (606 bytes per 1000 items) but this is offset by a larger merkle proof (96 bytes). Some significant savings are had for the <20 wallet item range and 150-1000 wallet item range, but only if the merkle proofs have become insignificant enough in size.
When moving into the >> 10000 block item territory, then M=3 (x=0.2835, 2.60 bits per block item at P=1) with 16 filters becomes reasonable as well.
Dealing with true-positive blocks

Until now I've discussed the problem entirely as if all block matches are false positives, but that's only true for empty wallets.
In general, it only makes sense to keep downloading filters if this minimizes the expected resource usage. Depending on what the initial likelihood is for an wallet item to appear in a block, we should download more or fewer matching filters until giving up and simply downloading the full block. This will vary from item to item. Taking an HD wallet for example:

Old used addresses are very unlikely to be used again.
Current utxos are unlikely to be spent unless the wallet makes a transaction involving them -- in which case it's extremely likely that the affected wallet items will appear in the next block.
Some addresses are expected to receive a payment soon, like the first few unused addresses.

Correctly minimizing the amount of downloaded data means carefully weighing the cost of downloading new filters vs the possible benefit that further filters can provide providing enough evidence to not need to download a full block. For each new filter downloaded, the probability of still-remaining items being in the block needs to be updated with Bayesian inferences. After an item has passed through enough filters, it will be highly likely that it is actually in the block and hence the block must be downloaded, meaning that additional downloading of filters is a waste. The correct approach seems to be:

Before downloading filter, combine individual wallet items' probabilities to obtain calculate P(block) - the probability that the block needs to be downloaded.
Calculate C_F - the expected cost of downloading more filters until the items are knocked out, assuming the block contains no items.
Calculate C_B - the expected cost of downloading the entire block.
Before downloading a new filters, consider the cost relative to just aborting and downloading the whole block:

Download more filters: costs C_F + C_B×P(block). In other words, downloading more filters will cost C_F, but there is a chance that we need to download the full block anyway.
Download the block: costs C_B.


Every time a filter is downloaded and processed, use Bayes' rule to update the probabilities for each remaining wallet item.

More sophistication in this theme is also possible. For example if you broadcast a transaction involving four addresses, the network conditions might be such that you're 95% sure that it appears in next block. Downloading a filter makes sense. If all four addresses persist past the first filter then that may be entirely sufficient evidence to make it worthwhile to download the full block. In contrast if one of the four addresses gets knocked out, then the transaction is definitely not in the new block.
Going further

It may be possible to save even further bandwidth for initial wallet scans by preparing coarse filters that span over runs of blocks. The filters may be somewhat smaller than the sum of the individual per-block filters, due to address reuse (if every address gets used twice during the run of blocks, but never in the same block, then the filter size will be halved). Some research on realistic blockchain data, and address reuse, can help to figure out if this is worthwhile or not, and what sort of timespan makes sense.