gavinandresen/balanced_merkle_path_update.md

## balanced_merkle_path_update.md

      
    Raw
  

              balanced_merkle_path_update.md
            
          
    Introduction

It would be spiffy to use the balanced merkle forest idea for ethereum tokens or to store unspent transaction outputs.
Tadge Dryja has been working on 'utreexo'
(presentation) for storing unspent transaction outputs in log(n) space;
this gist is inspired by, and is very similar to, that work.
So my previous gist describes really simple
algorithms for adding and removing items from a balanced merkle forest. This gist extends those operations to create
a map of 'adjustments' that can be used to update old merkle paths so they are valid after more items have been
added to, or removed from, the accumulator.
Addition

Adding items to the accumulator can make old, valid merkle paths a sub-path of a new, valid merkle path.
For example, starting with three items in a merkle forest (roots are h01 and h2):
    h01     h2
   /  \
  h0  h1

Merkle paths (proofs) for the three items are:
 (h0, [h1])
 (h1, [h0])
 (h2, [])

Adding one more item results in a merkle forest with a single non-empty root (h0123);
first, h3 is combined with h2 to produce the intermediate hash h23, and then
that is combined with h01 to produce the new top root h0123:
       h0123
     /      \
    h01     h23
   /  \    /  \
  h0  h1  h2  h3
  
Merkle paths are now:
 (h0, [h1, h23])
 (h1, [h0, h23])
 (h2, [h3, h01])
 (h3, [h2, h01])

We can maintain a map of 'adjustments' as follows:
Every time an old and new hash are combined during the add operation:
   set adjustments[old] = (EXTEND, new)
   set adjustments[new] = (EXTEND, old)

So after adding h3 to the three-element merkle forest, the adjustments map will
contain:
   h2 -> (EXTEND, h3)
   h3 -> (EXTEND, h2)
   h01 -> (EXTEND, h23)
   h23 -> (EXTEND, h01)

Updating the old merkle paths for addition is easy; compute the old root hash, and if
that root hash needs to be adjusted make the adjustment (and repeat until no more adjusting is needed):
E.g.  for merkle proof:  (h0, [h1])
Hash h0 and h1 to get root h01
adjustments[h01] is EXTEND h23, so proof becomes   (h0, [h1, h23])
Compute the root of that path: h0123
There is no adjustments[h0123] so nothing more to do: (h0, [h1, h23]) is the updated proof.

Updating the proof for item h2 is two adjustments:
old merkle proof: (h2, [])
root is h2 because path is empty
adjustments[h2] says EXTEND h3, so proof becomes (h2, [h3])
Compute h23
adjustments[h23] says EXTEND h01, so proof becomes (h2, [h3, h01])
Compute h0123; there is no adjustments[h0123] so we're done.

Removal

Updating merkle paths when items are removed is only slightly trickier than addition. First, lets consider just merkle forests with
single roots because they contains a power-of-two number of items.  E.g. start with a four-element merkle forest with root h0123:
       h0123
     /      \
    h01     h23
   /  \    /  \
  h0  h1  h2  h3

Remove h3, given proof path [h2, h01]). h3, h23, and h0123 all disappear, and h01 and h2
become root hashes in the forest:
    h01     h2
   /  \
  h0  h1

Merkle proofs that were valid before the Remove operation must be shortened to be valid
for the new forest; e.g. (h2, [h3, h01]) becomes just (h2, []).
The remove operation remembers the intermediate non-leaf-node hashes in the adjustments array:
  h23 -> REMOVED
  h01 -> REMOVED

Old merkle paths that went 'through' either h23 or h01 get truncated.
So given an old (item_hash, [merkle path]) proof, updating for deletions is:
for i in 0 to length of merkle_path:
   if adjustments[merkle_path[i]] is REMOVED: truncate the path

If the merkle forest contains a non-power-of-two number of items, then the hashes in the proof path of an item being deleted
may not become new roots but might be combined to form new roots. In that case, the adjustments array is also updated with EXTEND
entries, exactly like adding a new item. And updating an old proof is accomplished by running the 'update for deletion' code followed by
the 'update for additions' code.
Thumbnail sketch: space-efficient token contract

Ethereum ERC-20 tokens store everybody's token balance on the ethereum blockchain. For example, as I write this
Binance coin is the most popular ERC-20 token.
It has about 300,000 token holders and has had about million total on-chain transactions. On-chain transfers are uncommon,
averaging fewer than a dozen transfers per hour.
We could store merkle forest roots in a smart contract to kep track of token-holder records. The leaves of the merkle trees would be
(public_key, token_balance) pairs, recording who owns how many tokens. Transferring tokens is just validating that the
sender owns the tokens (validating merkle proofs and sender transaction signatures), then updating balances by removing and adding
leaves to the on-chain merkle forest.
The benefit is:
300,000 ownership records could be stored with only 19 on-chain root hashes: 608 bytes versus eleven megabytes of on-chain storage.
The cost is:
More complexity in the contract code; more chances for bugs. The merkle forest code is fairly simple,
but not as simple as _balances[_from] -= amount; balances[_to] += amount;
More complexity for token-holding wallets; they would have to be able to create an up-to-date merkle proof to one of
the roots stored in the contract, given the token-holder's public key.
More network bandwidth, because transfer transactions must include merkle proofs. Transactions could be batched
and merkle proofs near each other in the full merkle tree could be combined to save bandwidth, at the cost of even more code
complexity.
Thumbnail sketch: log(n)-space unspent transaction output storage or fast UTXO syncing

Bitcoin nodes validate transactions by ensuring that they only spend previously-unspent transaction outputs. Currently
they accomplish this by maintaing a database of every "UTXO".
An alternative design for a storage-constrained node would just keep a merkle forest accumulator of all of the unspent
transaction outputs, and rely on either wallets or heavy-weight nodes to include merkle path proofs that inputs
are part of the accumulator (along with the rest of the transaction data).
The node would also have to keep at least a few megabytes worth of 'adjustments', because it might receive a
transaction with merkle paths that refer to an old accumulator state.
It probably doesn't make sense to implement such a node; bandwidth tends to be the scaling bottleneck, not storage, and
transmitting merkle paths with transactions makes them significantly larger.
It might make sense to use a merkle forest accumulator for fast bootstrapping of new full nodes. A new node might fetch all
the block headers and get the merkle forest accumulator from some trusted source (either the blockchain if miners agree
to a soft or hard fork to include it in the block data or maybe from one or more trusted or semi-trusted peers).
It could then fetch the full UTXO set from peers and validate it against the accumulator.
But a Merklix tree
is better; it gets you proof-of-absence and only requires one root hash to be stored. The only advantage a balanced merkle forest
has over a Merklix tree is being able to add new values without a merkle path showing where in the tree the new
value should be inserted.