ajtowns/assumeutxo-muhash.md Secret

## assumeutxo-muhash.md

      
    Raw
  

              assumeutxo-muhash.md
            
          
    Idea: it would be nice for every node to maintain a rolling hash of utxo information that can be used to download an assumeutxo set. but you want to be able to incremental downloads and check against the hash as you go, otherwise you might download many GB of data, only to have to discard it all because of a single bit error. can we make it so muhash works for this, somehow?
How about:

rather than having a single muhash, maintain a set of muhash hashes each covering ~50MB of utxo data.
this is maybe 100kB of hash data given 5GB of utxo data, which seems trivial
if you store this state permanently every 10,080 blocks (10 weeks / 2.3 months) that's perhaps an extra 7.7MB, or 500kB per year
maintain the following PartialUtxoSetInfo for each hash:

muhash3072 (the hash!)
block_range (the range of utxos this covers, by the block height they were created in)
utxo_count (number of utxos covered)
utxo_serialized_size (serialized size of utxos covered)
utxo_value (total value of utxos covered)


when you get a new block, add all the new utxos the the most recent PartialUtxoSetInfo, and remove all the spent utxos from their corresponding PartialUtxoSetInfo
after processing the block, if the most recent PartialUtxoSetInfo has utxo_serialized_size > 50MB, create a new one
after processing a block with a height multiple of 10,080, scan through all the PartialUtxoSetInfos, and if the combined utxo_serialized_size of two adjacent PartialUtxoSetInfos is less than < 75MB, combine them (add the m uhash3072 values, combine the block range, sum the count/serialized_size/value)
when reorging, if you're reorging to a block with height 10080x, restore your PartialUtxoSetInfo from cache; otherwise just undo what you did when adding the block

Then, when downloading a utxo set, first choose a height that's a multiple of 10080, obtain the ~100kB set of PartialUtxoSetInfo descriptors, and download the 75MB-100MB of serialized utxo data for each pair of PartialUtxoSetInfo, verifying the muash/utxo_count/utxo_value at each step.
Then you could perhaps have blocks 10080*x + 100 + [0,10080) would commit to the hash of the PartialUtxoSetInfos at height 10080*x as an additional validation rule. (maybe 1008 not 100? long enough to easily calculate the hash in the background, and that reorgs are unlikely)
I think you could reasonably construct the utxo set for block X=10080*x despite the tip (and your utxo set) being at some later height Y=X+k, as follows:

have a separate file for each block_range
copy any current utxos from blocks <=X into those files (make sure you don't miss any utxos that aren't touched while you're doing this, but it's okay to miss any that were added/removed)
once you're done, note the new block height Z=Y+j
then scan through each rev.dat for heights X+1 to Z inclusive, and add any utxos from blocks <=X to the files; deduping as necessary
order the utxos by block, txid, vout perhaps? useful for deduping, but otherwise not super important, since muhash doesn't depend on order of insertion/deletion

Bugs?
If you're doing 10080 blocks, then storing the set, and only then compressing them, maybe that ends up being significantly >100kB of description, which would be annoying. Worst case blocks could create 50MB of utxos in ~50 blocks, so the above could be as bad about 200 additional hashes in 10080 blocks, which might be an extra 84kB of data. So probably fine?
37.5MB-50MB is significantly larger than recommended piece size for torrents (256kB-1MB) and the block/tx size for bitcoin (2-4MB, 200B-400kB). Could have snapshots be larger (factor of 10x targetting ~500kB w/ 3-5MB per piece maybe?) and less frequent (factor of 2.5 for once every 6 months, reducing it to a 4x increase overall?), though being less frequent would increase how much the previous bug hurts... Could keep the frequency at 10080 blocks, but only retain the last m of them,
preventing reorgs greater than m*10080 blocks; but that's probably fine? (Well, wouldn't prevent the reorg, would just force you to recalculate this data all the way back to genesis, since you can't undo a merge. Though if you tracked hashes of the data, you could just download it from a peer and check the hash...)