gavinandresen/UTXO_BitVector.md

## UTXO_BitVector.md

      
    Raw
  

              UTXO_BitVector.md
            
          
    Half-baked thoughts exploring a different way of implementing a fully-validating BCH node.
The idea is to shift the storage of full transaction data to wallets, and explore how little data a fully validating
node could store. This isn't a problem today (the UTXO set easily fits in the RAM of an inexpensive server-class machine),
but might eventually be at very large transaction volumes.
Initial block download is a problem today (it is annoying to have to wait several hours or days to sync up a new node),
and this scheme could make it orders of magnitude faster by shifting the time when full transaction data is broadcast from
initial block download to new transaction announcement.
Start with a network protocol where wallets broadcast transactions as:

The transaction's data (this is what current BCH protocol broadcasts in 'tx' messages)
PLUS, for every confirmed input of the transaction:

Data for the transaction that is being spent AND
A Merkle path for that previous transaction to the merkle root in the block header


... and the network protocol also requires that all new block messages are preceded by transaction
messages with previous inputs and merkle paths for every transaction in the block.
Let every node in this new network store a bit-vector for every block, where there are as many bits as there are
outputs in the block, and each bit store whether or not that output has been spent (1 == unspent, 0 == spent).
With the network protocol described above and bit-vectors for every block, the node has enough information to fully
validate every transaction.
Back-of-the-envelope calculations for gigabyte-sized blocks:  A gigabyte block full of 250-byte 2-input, 2-output
transactions would have 4 million outputs, which would be a 0.5 megabyte bit-vector. 10 years of gigabyte-sized blocks would fit
in 263 gigabytes of RAM, which is reasonable for a powerful server (servers with terabytes of RAM are common these days).
The bit-vectors for old blocks are HIGHLY compressible, because most outputs are spent (TODO: run some
BlockSci queries to see how quickly percentage unspent decays....), so the UTXO set can be much smaller on disk
(TODO: how much smaller?).
If lookup speed when processing new transactions is not a concern, the bit-vectors can be stored compressed in memory;
I'm sure there are clever data structures for efficiently querying and updating sparse bit-vectors.
Initial block download for a new node could be very efficient:

Query N peers for the current best block hash (to find tip of best-work chain)
Fetch all 80-byte block headers from peers
For each block, fetch compressed UTXO bit-vector from N peers; if they all match, great. If they disagree, then
maybe your peers are malicious, so... uhhh... do something........

The "something" could be:


Fail and complain loudly. This might be the best option if your IBD peers are hand-picked ("I mostly trust Coinbase and
MIT's DCI and my friend Bob, so I'll peer with them to bootstrap quickly")


Use a trust threshold: if 8 or 9 of 10 peers agree, then assume supermajority is correct and disconnect/ban the others.


Pick a different, random set of N peers and try again.


If you're really paranoid, fall back to downloading full blocks from archiving nodes.


I think the best option is to let node operators hand-pick one or more semi-trusted nodes to get fast boot-strapping;
that is simple and there are plenty of full-node operators who have the right incentives to always serve up correct
UTXO bit-vectors.
To ponder: is speeding up initial block download and saving memory at the cost of 2-3 times the bandwidth
for new transaction announcements the right tradeoff? A gigabit per second connection is 75 gigabytes
every ten minutes, so plenty of bandwidth for a few gigabytes of transaction data that translates into
a gigabyte-sized block.