Skip to content

Instantly share code, notes, and snippets.

@gavinandresen
Last active December 14, 2023 23:28
Show Gist options
  • Save gavinandresen/f209a02ee559905aa69bf56e3b41040c to your computer and use it in GitHub Desktop.
Save gavinandresen/f209a02ee559905aa69bf56e3b41040c to your computer and use it in GitHub Desktop.
Storing the UTXO as a bit-vector

Half-baked thoughts exploring a different way of implementing a fully-validating BCH node.

The idea is to shift the storage of full transaction data to wallets, and explore how little data a fully validating node could store. This isn't a problem today (the UTXO set easily fits in the RAM of an inexpensive server-class machine), but might eventually be at very large transaction volumes.

Initial block download is a problem today (it is annoying to have to wait several hours or days to sync up a new node), and this scheme could make it orders of magnitude faster by shifting the time when full transaction data is broadcast from initial block download to new transaction announcement.

Start with a network protocol where wallets broadcast transactions as:

  • The transaction's data (this is what current BCH protocol broadcasts in 'tx' messages)
  • PLUS, for every confirmed input of the transaction:
    • Data for the transaction that is being spent AND
    • A Merkle path for that previous transaction to the merkle root in the block header

... and the network protocol also requires that all new block messages are preceded by transaction messages with previous inputs and merkle paths for every transaction in the block.

Let every node in this new network store a bit-vector for every block, where there are as many bits as there are outputs in the block, and each bit store whether or not that output has been spent (1 == unspent, 0 == spent).

With the network protocol described above and bit-vectors for every block, the node has enough information to fully validate every transaction.

Back-of-the-envelope calculations for gigabyte-sized blocks: A gigabyte block full of 250-byte 2-input, 2-output transactions would have 4 million outputs, which would be a 0.5 megabyte bit-vector. 10 years of gigabyte-sized blocks would fit in 263 gigabytes of RAM, which is reasonable for a powerful server (servers with terabytes of RAM are common these days).

The bit-vectors for old blocks are HIGHLY compressible, because most outputs are spent (TODO: run some BlockSci queries to see how quickly percentage unspent decays....), so the UTXO set can be much smaller on disk (TODO: how much smaller?). If lookup speed when processing new transactions is not a concern, the bit-vectors can be stored compressed in memory; I'm sure there are clever data structures for efficiently querying and updating sparse bit-vectors.

Initial block download for a new node could be very efficient:

  1. Query N peers for the current best block hash (to find tip of best-work chain)
  2. Fetch all 80-byte block headers from peers
  3. For each block, fetch compressed UTXO bit-vector from N peers; if they all match, great. If they disagree, then maybe your peers are malicious, so... uhhh... do something........

The "something" could be:

  • Fail and complain loudly. This might be the best option if your IBD peers are hand-picked ("I mostly trust Coinbase and MIT's DCI and my friend Bob, so I'll peer with them to bootstrap quickly")

  • Use a trust threshold: if 8 or 9 of 10 peers agree, then assume supermajority is correct and disconnect/ban the others.

  • Pick a different, random set of N peers and try again.

  • If you're really paranoid, fall back to downloading full blocks from archiving nodes.

I think the best option is to let node operators hand-pick one or more semi-trusted nodes to get fast boot-strapping; that is simple and there are plenty of full-node operators who have the right incentives to always serve up correct UTXO bit-vectors.

To ponder: is speeding up initial block download and saving memory at the cost of 2-3 times the bandwidth for new transaction announcements the right tradeoff? A gigabit per second connection is 75 gigabytes every ten minutes, so plenty of bandwidth for a few gigabytes of transaction data that translates into a gigabyte-sized block.

@mariodian
Copy link

the latest XBox One X games seem to all require 50 to 100+ Gig download sync's per new game purchase these days

@1mbtxn is every XBox owner constantly broadcasting this data to other users? If not then the comparison is a bit off. It's a big difference downloading n * 50GB of data per year and downloading and most importantly uploading gigabytes of data per day 24/7.

@Danconnolly
Copy link

The UTXO bit-vector changes. Imagine the following scenario: you connect to a peer and start syncing. A block is found which spends a UTXO from a block for which you have not yet retrieved the UTXO bit-vector. Will you receive the updated UTXO bit-vector or the original?

It's not reasonable to expect a peer to keep track of all the peers that are syncing.

One option: send the height of the chain with each bit-vector, but this would increase transmission size considerably with a lot of redundant data.

Another option: make sure that the peer sends the block announce before sending anymore bit-vectors but this depends on good behaviour of the peer, what would happen if it misbehaved? Something to think about.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment