Skip to content

Instantly share code, notes, and snippets.

@oleganza
Last active August 29, 2015 14:00
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save oleganza/11373971 to your computer and use it in GitHub Desktop.
Save oleganza/11373971 to your computer and use it in GitHub Desktop.
BIP: Fractional Storage Identifier

BIP: Fractional Storage Identifier

Abstract

In Bitcoin we might want to have fractional storage of blocks and transaction state so that many more nodes can fully validate incoming blocks without sacrificing a lot of disk space. We propose a scheme that allows nodes to advertise which fraction of the data set they can provide and enables uniform distribution of such fractions among nodes. The scheme also allows reducing individual fraction at no extra cost to scale with the gbrowth of the network.

Scheme

When started for the first time, node chooses a unique persistent 256-bit node identifier. This identifier will be used to determine which subset of data is being stored on that node. The node also chooses initial "fraction" parameter - a 32-bit unsigned integer that encodes a fraction between 0% and 100% (0x00000000 means 0% and 0xffffffff means 100%). For simplicity, the node combines these two parameters in one fraction identifier by replacing first 4 bytes of the node identifier with the fraction parameter in big-endian encoding.

A node advertises such fraction identifier to every node that connects to it. This way its peers can know which portion of the dataset is accessible on the node.

To decide whether to store a piece of data (or whether a peer with the given fraction identifier has the data), a node executes the following algorithm:

  1. Extract big-endian uint32 fraction from the fraction identifier.
  2. Compute SHA256(SHA256(identifier[4,28] || piece identifier)). Note that the first 4 bytes of the identifier are skipped (so that result does not depend on current fraction). For blocks, the piece identifier is simply a block height in big-endian uint32. For transactions and outputs, it is a transaction hash. For other kinds of data it can be any other meaningful identifier.
  3. Take the first 4 bytes of the resulting hash interpreted as position (big-endian uint32).
  4. Compare position with the earlier extracted fraction. If position is < fraction, then this piece of data should be stored for the given fraction identifier.

Scalability

Nodes can decrease the amount of data they are willing to store without breaking their node identifier and re-downloading the entire blockchain simply by decrementing the fraction parameter. They can then clean up non-matching pieces at any later time or never. Note that incrementing fraction is not possible without re-downloading all pieces and rebuilding the entire state from scratch.

If several peers are able to provide a given piece, a node may choose the one which has the lowest position. This will provide a uniform load distribution across all nodes. Alternatively, node may use more complex scheme to determine which peer to choose (based on latency, bandwidth etc.) while using position as one of the factors.

As number of online peers grows and amount of data to be stored grows too, every peer may slowly decrease their fractions of data, keeping it within reasonable bounds, but still having enough peers to fetch missing pieces from.

Notes

This BIP does not discuss how exactly nodes should store or index their data set. It only discusses how to efficiently identify fractions and downsize them at will.

Using block height instead of block hash as a piece identifier is useful to allow nodes to download blocks before knowing their hashes. Also, forked blocks of the same height are guaranteed to be stored together for more efficient handling of chain reorganization.

We propose to store 4-byte fraction within 32-byte identifier (and reduce uniqueness to 28 bytes) for simplicity of handling a single 32-byte string instead of having a larger string or handling two values (fraction and node identifier) separately.

It is also prudent not to rely on a separate possible "node identifier" that may be used for other purposes. If one needs to change some other identifier, this scheme allows to keep the same storage fraction without re-downloading everything. Alternatively, if one needs to rebuild the storage from scratch, a storage identifier may change without affecting any other identifier.

Reference Implementation

Please see the attached ruby script.

#!/usr/bin/env ruby
require 'digest/sha2'
require 'openssl'
# Creates random identifier with a given fraction
def new_identifier_with_fraction(fraction = 0.5)
random_identifier = OpenSSL::Random.random_bytes(32)
fraction_int = (0xffffffff * fraction).to_i
return [fraction_int].pack("N") + random_identifier[4,28] # N stands for big-endian ('network') byte order
end
# Changes the fraction in this identifier. Returns existing identifier if new fraction is greater than the existing one.
def update_identifier_with_fraction(identifier, new_fraction)
new_fraction_int = (0xffffffff * new_fraction).to_i
fraction_int = identifier[0,4].unpack("N").first
if new_fraction_int > fraction_int
return identifier # cannot increase fraction without rescanning the entire dataset
end
return [new_fraction_int].pack("N") + identifier[4,28]
end
# Returns true if block with the given height is stored by a peer with this identifier.
def block_height_matches_identifier(block_height, identifier)
return piece_matches_identifier([block_height].pack("N"), identifier)
end
# Returns true if piece of data with the given hash is stored by a peer with this identifier.
def piece_matches_identifier(hash, identifier)
fraction_int = identifier[0,4].unpack("N").first
position = Digest::SHA256.digest(Digest::SHA256.digest(identifier[4,28] + hash))[0,4].unpack("N").first
return position < fraction_int
end
if $0 == __FILE__
id = new_identifier_with_fraction(1.0)
[1.0, 0.5, 0.01].each do |prob|
id = update_identifier_with_fraction(id, prob)
blocks_stored = 0
total_blocks = 10_000
total_blocks.times do |i|
blocks_stored += 1 if block_height_matches_identifier(i, id)
end
puts "Blocks for identifier #{id.unpack("H*").first} (#{(prob*100)}%): #{blocks_stored} / #{total_blocks}"
end
# Blocks for identifier ffffffffa836f5d8230133417652ad34889dcccfc9c83e7b2ebbeb0bf31ed80c (100.0%): 10000 / 10000
# Blocks for identifier 7fffffffa836f5d8230133417652ad34889dcccfc9c83e7b2ebbeb0bf31ed80c (50.0%): 5000 / 10000
# Blocks for identifier 028f5c28a836f5d8230133417652ad34889dcccfc9c83e7b2ebbeb0bf31ed80c (1.0%): 97 / 10000
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment