gavinandresen/Balanced_Merkle_Forest.md

## Balanced_Merkle_Forest.md

      
    Raw
  

              Balanced_Merkle_Forest.md
            
          
    I want a cryptographic accumulator¹ that has the following properties:

It can be build incrementally using a small amount of persistent memory
Witness proofs are reasonably small.
Witness proofs are computationally easy to verify.
Must have at least 100-bit security.

The idea is to minimize on-chain storage; the accumulator will live on-chain, maintained by a smart contract.
Transactions submitted to the smart contract can add to the accumulator. And then later transactions can be created
that include proofs that a value is part of the accumulator.
For example, a lottery contract might work as follows:
Transactions are submitted to a lottery contract, each paying for one (or more) lottery 'tickets'.
The smart contract adds each 'ticket' to the accumulator, but does not remember every entry individually -- they
all get mushed together.
At some point, one or more winners are chosen (this is non-trivial, but I won't go into details for this gist).
The winner then has a certain amount of time to claim their winnings, by submitting a proof that their 'ticket' is
part of the accumulator. If the ticket is a winner and the proof is valid, the contract pays out.
The reason to use an accumulator and not just store every entry is scalability; blockchain storage is expensive,
and a lottery that cost less than a cent to enter is better than one that costs a dollar (or more if
transaction fees are high).
There are many academic papers on cryptographic accumulators. The simplest accumulator is a merkle tree, but a simple
merkle tree doesn't work for our application because it can't be built incrementally².
There are wicked-cool accumulator schemes that rely on pairing based cryptography, and one of them would probably
meet my criteria, but after spending a lot of time researching and thinking I decided not to go that route for two
reasons:  1) I don't think I'm smart enough to get the code right.  2) CPU costs for a pairing operations (in ethereum)
are just as bad as costs for on-chain storage.
Merkle trees are almost perfect, I just need to be able to build them incrementally. Here is how:
Instead of storing one merkle root as the accumulator, store log2(N) merkle roots, where N is the number of items
added. Merkle root 'i' will contain either nothing or will be a balanced merkle tree containing 2ⁱ items.
This structure can be built incrementally; the algorithm to add an entry is just a few lines of code:
if there is nothing at merkle_roots[0]:
  set merkle root 0 to the item being added (remember the new item)
otherwise:
  build a new 2-item balanced merkle tree using the new item and merkle_roots[0]
  set merkle_roots[0] to 'nothing'
  repeat this algorithm with the new 2-item tree and merkle_roots[1] 

merkle_roots[] is essentially counting in binary; after the first item is added, merkle_roots[0] will contain it.
After eleven items are added (binary 1011), merkle_roots[3] will be the root of an 8-element balanced merkle tree,
merkle_roots[2] will be empty, merkle_roots[1] will have 2 items, and merkle_roots[0] will have the last item added.
Anybody watching all of the accumulated values can easily create a full merkle tree, and create merkle paths
up to one of the merkle_roots.
Continuing our lottery example: a lottery with four billion entries (one for almost every person in the world) would
need just 32 entries in the merkle_roots[] array (32 hash values, about 900 bytes). A winner's proof would also be 32
hash values, and validating the proof is just 32 hash operations.
Removing a value from the accumulator, given a merkle path proving that the value is part of the
accumulator, is also easy-- all the information needed to update the accumulator is in the merkle path. The remove algorithm is:
 Use the path to verify that the value is in the accumulator; if it is not, return 'fail'
 set i = length of the path
 set roots[i] to 'nothing'
 for j in 0 to length of the path:
     combine path[j] with roots[j] using the add_item algorithm
 return 'success'

For example, start with eleven items in the accumulator, 0 through b. The merkle forest will have three trees--
one with eight items, one with two, and one with one. The accumulator will look like:
      h12345678        h9a        hb

... where hb is the hash of the last item, h9a is the root of the 2-element hash tree containing items 9 and ten, etc.
The merkle path proof for item number 3 can be just the array of hashes: [h4, h12, h5678] ... if we use
a hash-combining function that doesn't care about the order in which siblings are hashed (if the order
does matter then the merkle path has to keep track of left/right siblings). I use
if hash1 < hash2: return hash(hash1+hash2) else return hash(hash2+hash1). Hash together item number three
and [h4, h12, h45678] and you get the root hash h12345678, proving that item 3 is in the accumulator.
Add a twelfth item 'c', and it first gets combined with 'b' and then '9a', and the new accumulator state is:
    h12345678        h9abc

The merkle path proof for item number 3 is the same as before: [h4, h12, h5678]. Removing it means the h12345678
root disappears, and the sub-trees h4, h12, and h5678 get re-added:
    h12345678        h9abc   <-- accumulator state before removal
    h9abc                    <-- remove the root containing item 3
    h9abc    h4              <-- add the first item in the merkle path, h4
    h9abc    h12    h4       <-- add h12
    h56789abc    h12    h4   <-- add h5678, it combines with h9abc and we're done with removal.

Accumulator roots can change with every add or remove, invalidating any merkle
paths that might be part of transactions waiting to interact
with the smart contract (not a problem for a lottery, with distinct 'enter' and 'payout' time periods). I have created
another gist describing how old paths can
be updated as items are added or removed.
References

Efficient Asynchronous Accumulators for Distributed PKI : 2016 paper I found
that also describes the 'balanced merkle forest' data structure (but they don't use that name).
Footnotes

Footnotes


A cryptographic accumulator is a weird representation of set of values. You can add lots of values to the accumulator,
which mixes them all together in a clever way so they take up many fewer bits than just storing a list of all the values.
And you can create a "witness" proof for a value that is in the set that proves to somebody who only has the accumulator
that the value was, indeed, added to the accumulator. ↩


That's not actually true; you can build an extremely unbalanced merkle tree incrementally, but then one
of our other criteria (proof sizes must be reasonable) is violated. ↩