Skip to content

Instantly share code, notes, and snippets.

@gavinandresen
Last active June 7, 2021 17:45
Show Gist options
  • Star 7 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save gavinandresen/e9177aae1183226937fca8e8cbfc5f79 to your computer and use it in GitHub Desktop.
Save gavinandresen/e9177aae1183226937fca8e8cbfc5f79 to your computer and use it in GitHub Desktop.
Using a cuckoo filter for fast 'unspent?' lookups

A "Cuckoo Filter" is a nifty data structure invented a few years ago; read the paper for details.

Like a Bloom filter, you insert items and then can later ask "does this item exist in the filter?" You'll get either a definite "no" or "yes, probably" with some false-positive error rate. Cuckoo filters have two advantages over Bloom filters:

  1. They are more space efficient at false positive rates less than about 0.03.
  2. You can delete items from them without affecting the false positive rate at all.

It seems to me that an in-memory cuckoo filter could work really well to keep track of Bitcoin's "unspent transaction output set" (UTXO set), to make transaction (or block) validation as fast as possible using minimal memory or disk.

Recall that Bitcoin transactions (other than coinbase transactions that create new coins) have inputs that refer to unspent outputs. An input refers to a previous output by giving the transaction id (a 32-byte hash) containing the output and an integer index (a 4-byte integer).

So the idea is to add those 36-byte (prevTxid,n) outputs to a cuckoo filter as outputs are created, and remove them when outputs are spent. And look them up in the filter when making sure a new transaction's inputs are unspent. Insertion, deletion, and lookup for a cuckoo filter are all constant-time operations.

Chain re-orgs at the tip are easily supported, if block B replaces block A at the tip just delete all of the unspent outputs A created, add the outputs A spent, and then process B as normal.

More information than "are the inputs unspent?" is needed for transaction validation-- the previous transactions are needed. The cuckoo filter doesn't store that information, so it has to come from somewhere. The most elegant solution would be to run using a network protocol where transaction messages included all the information needed for a cuckoo-filtering peer to validate the transaction. Wallets already store all of that supporting transaction information.

The size of the cuckoo filter depends on the false positive rate and the number of items in it. Currently the Bitcoin UTXO has about 60 million items and has been doubling about every two years. We can get a false positive rate of less than one-in-a-million using 24 bits (3 bytes) per item -- or a 180 megabyte cuckoo filter. Allocate a gigabyte of memory and the filter should work nicely for the next six or seven years.

What if the filter fills up? The easiest solution is to just create a new, empty filter and add/delete/check using both of them; but using n filters in sequence increases the false positive rate by n (e.g. 2 filters, each of which have a 1-in-a-million false positive rate, results in a 2-in-a-million false positive rate). Another alternative is to recreate the UTXO set using one, bigger filter.

Getting probabilistic with transaction validation sounds like a really bad idea, and maybe it is; hardware is cheap, so just storing the entire UTXO set in memory indexed by (prevTxid,n) is the right thing to do if you've got a big, fast server with lots of RAM.

But lets run the thought experiment: what if miners were running with cuckoo filters with 1-in-a-million false positive rates? What's the worst that could happen?

First, assume the code is not stupid and randomizes the cuckoo filter, so every machine will have different false-positive behavior. An attacker could try to re-spend an already-spent transaction output; they'd create a transaction re-spending the output, send it to a miner, and hope that it collides with an unspent output. Most of the time it won't, and the transasction will be rejected as invalid. But if it does, the miner will add it to their block...

... but if the miner manages to mine that block the rest of the network will reject the block (because false positives on one machine are independent of false positives on another, the chance a supermajority of the network accepts the invalid transaction is essentially zero). Bad for the miner, but safe for the integrity of the network.

The only way this attack makes sense is if the attacker is another miner. The attacker would try to spam the miner with transactions that spend non-existent outputs, knowing that eventually one will collide with a real, unspent output. The attack can be mitigated by banning peers that give you invalid transactions, since an attacker will have to give you 500,000 transactions to have a 50% chance of one slipping through the cuckoo filter. Or mitigated by using a bigger cuckoo filter with a smaller false-positive rate. Miners should be able to figure out the risk/reward tradeoff (where the risk is their blocks getting rejected but the reward is using less expensive hardware or validating transactions/blocks faster).

Another thought experiment: a UTXO cuckoo filter could be used to increase 0-confirmation-transaction security for a wallet that sees (but doesn't store) every block and transaction. Double-spends could be detected with probability dependent on the amount of memory dedicated to the cuckoo filter.

@kf106
Copy link

kf106 commented Nov 23, 2017

At what point exactly would a miner risk wasting it's hash rate on a block containing an invalid transaction though? There has to be a cost of mining vs cost of validating the block contents, and I'm not sure you've fully taken that into account.

@jprupp
Copy link

jprupp commented Dec 1, 2017

I would recommend you have the complete UTXO set on disk as well, and only look the output up when a transaction that was received passes the cuckoo filter. This will reduce the number of expensive IO-bound lookups on the disk-based set to those that pass the filter, therefore preventing a denial-of-service attack caused by a flood of transactions spending invalid outputs. Of course this is not really much of an issue for most modern servers that can perfectly well store the entire UTXO set in memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment