gavinandresen/Backwards_IBD.md

## Backwards_IBD.md

      
    Raw
  

              Backwards_IBD.md
            
          
    This is a scheme for discovering the current "unspent transaction output set"; it could be used to bootstrap a node that stores the UTXO set using a Cuckoo Filter.
It has the following nice properties:

It starts with an empty UTXO set, but quickly discovers a large enough subset to validate a majority of new transactions (TODO: empirically measure how many blocks to be able to validate 50+% of new tx), and incrementally gets better over time.
It uses a simple data structure to keep track of progress (TODO: empirically calculate maximum size of the spent set processing tip to genesis).
It doesn't require any consensus-level change or require miners to maintain any extra data structures or indexes.

However, it does require downloading every single block.
The idea of downloading the chain of block headers, then "operate in SPV mode while downloading the full chain in the background" is almost as old as Bitcoin. This is a proposal for how, exactly, to do that. It is very possible somebody else has proposed this same algorithm in the past (and I'll be happy to link to previous work that I am unaware of).
The idea is to process blocks in reverse order. The outputs of the very last transaction in the block chain are obviously in the UTXO set, because they can't have been spent.
The inputs of the very last transaction are added to a "spent_set".
Then, for the second-to-last and subsequent transactions in the chain, if an output is not in the spent_set it is added to the utxo set. If it is in the spent_set, remove it from the spent_set (we've seen the output's creation and spend, so can safely forget about it). The transaction's inputs are all added to spent_set.
The algorithm in pseudo code:
Start with:
an empty utxo_set that is a set of (transaction_id, output_index) pairs
an empty spent_set that is a set of (transaction_id, output_index) pairs
a list of valid block headers from a genesis block to current block tip

For each block from the tip of the chain down to the genesis block:
    Fetch all the transactions in the block (the full block data)
    For each transaction in the block from the last to the second (not the coinbase transaction):
        For each input of the transaction:
            add the input's previous output (txid,n) to the spent_set
        
        For each output of the transaction:
            if the output is not in spent_set: add it to the utxo_set
            if the output is in spent_set: remove it from spent_set

    Coinbase transactions (that mature after 100 blocks) have to be handled slightly differently:
    If the block's height is greater than 101:
        For each output of coinbase transaction at block_height-100: # TODO: triple-check for off-by-one errors here!
            if the output is not in spent_set: add it to the utxo_set
            if it is in spent_set: remove it from spent_set

In practice, most transaction outputs are spent quickly (within a few dozen blocks), so the spent_set remains small compared to the set of all spent transaction outputs (TODO: figure out how small).
After processing all blocks of a valid chain the spent_set will be empty and the utxo_set will have a complete set of unspent transaction outputs.
Processing new blocks while catching up

Adding a new block to the tip of the chain while in the middle of catching up is almost the same algorithm:
    For each input of the transaction:
        if the input is in the utxo_set: remove it from the utxo_set
        if the input is not in the utxo_set: add the input's previous output (txid,n) to the spent_set
        
    For each output of the transaction:
        if the output is not in spent_set: add it to the utxo_set
        if the output is in spent_set: remove it from spent_set

That is, spent outputs are either removed from the utxo_set (which means we're caught up to the block that created them) or added to the spent_set (because we haven't caught up yet).  After catching up, an input not in the utxo_set is a validation error.
Removing a block at the tip of the chain due to a chain re-org is just the inverse (new outputs are removed from the UTXO set, and inputs are removed from the spent_set and added to the UTXO set).
Validating new transactions or blocks while catching up

Many transactions can be validated before the entire chain has been processed-- if all of a transaction's inputs were created after the earliest block downloaded so far, it can be fully validated. Transactions that cannot be fully validated can be ignored until the rest of the network validates them and they appear in a new block.
New blocks can be at least partially validated, but could contain transactions with inputs that refer to blocks early in the chain that have not yet been downloaded (TODO: what is the 'median oldest txout' in recent blocks?).