In Bitcoin we might want to have fractional storage of blocks and transaction state so that many more nodes can fully validate incoming blocks without sacrificing a lot of disk space. We propose a scheme that allows nodes to advertise which fraction of the data set they can provide and enables uniform distribution of such fractions among nodes. The scheme also allows reducing individual fraction at no extra cost to scale with the gbrowth of the network.
When started for the first time, node chooses a unique persistent 256-bit node identifier. This identifier will be used to determine which subset of data is being stored on that node. The node also chooses initial "fraction" parameter - a 32-bit unsigned integer that encodes a fraction between 0% and 100% (0x00000000 means 0% and 0xffffffff means 100%). For simplicity, the node combines these two parameters in one fraction identifier by replacing first 4 bytes of the node identifier with the fraction parameter in big-endian encoding.
A node advertises such fraction identifier to every node that connects to it. This way its peers can know which portion of the dataset is accessible on the node.
To decide whether to store a piece of data (or whether a peer with the given fraction identifier has the data), a node executes the following algorithm:
- Extract big-endian uint32 fraction from the fraction identifier.
- Compute SHA256(SHA256(identifier[4,28] || piece identifier)). Note that the first 4 bytes of the identifier are skipped (so that result does not depend on current fraction). For blocks, the piece identifier is simply a block height in big-endian uint32. For transactions and outputs, it is a transaction hash. For other kinds of data it can be any other meaningful identifier.
- Take the first 4 bytes of the resulting hash interpreted as position (big-endian uint32).
- Compare position with the earlier extracted fraction. If position is < fraction, then this piece of data should be stored for the given fraction identifier.
Nodes can decrease the amount of data they are willing to store without breaking their node identifier and re-downloading the entire blockchain simply by decrementing the fraction parameter. They can then clean up non-matching pieces at any later time or never. Note that incrementing fraction is not possible without re-downloading all pieces and rebuilding the entire state from scratch.
If several peers are able to provide a given piece, a node may choose the one which has the lowest position. This will provide a uniform load distribution across all nodes. Alternatively, node may use more complex scheme to determine which peer to choose (based on latency, bandwidth etc.) while using position as one of the factors.
As number of online peers grows and amount of data to be stored grows too, every peer may slowly decrease their fractions of data, keeping it within reasonable bounds, but still having enough peers to fetch missing pieces from.
This BIP does not discuss how exactly nodes should store or index their data set. It only discusses how to efficiently identify fractions and downsize them at will.
Using block height instead of block hash as a piece identifier is useful to allow nodes to download blocks before knowing their hashes. Also, forked blocks of the same height are guaranteed to be stored together for more efficient handling of chain reorganization.
We propose to store 4-byte fraction within 32-byte identifier (and reduce uniqueness to 28 bytes) for simplicity of handling a single 32-byte string instead of having a larger string or handling two values (fraction and node identifier) separately.
It is also prudent not to rely on a separate possible "node identifier" that may be used for other purposes. If one needs to change some other identifier, this scheme allows to keep the same storage fraction without re-downloading everything. Alternatively, if one needs to rebuild the storage from scratch, a storage identifier may change without affecting any other identifier.
Please see the attached ruby script.