paulkernfeld/vespucci.md

## vespucci.md

      
    Raw
  

              vespucci.md
            
          
    Vespucci

Vespucci is a proposed protocol that uses the Bitcoin blockchain for decentralized application discovery.
Although there have been other proposals on this topic, Vespucci is to my knowledge the first one that does not require the entire blockchain to be downloaded. This results in a massive performance improvement.
The Problem

In order for an application to join a P2P network, the application must somehow find the IP address of one or more peers to connect to. This is called bootstrapping, and it can be difficult. BitTorrent and Bitcoin clients typically include hard-coded addresses of bootstrapping nodes, long-lived server nodes. This works well for established P2P networks with multiple bootstrap nodes like Bitcoin and BitTorrent. It doesn't work well for new and/or small P2P networks, because a small network might only have a single bootstrap node, creating a single point of failure.
Vespucci is designed to provide discovery for applications with the following requirements:

The P2P application needs to download a global list of peers when it's first opened.
Anyone should be able to register bootstrapping peers for discovery.
Read performance, i.e. the time to read peers and join the network, is very important.
Write performance, i.e. the time to register a new peer, is very unimportant.
The ability to discover the Bitcoin network is assumed.

Blockchain Storage Politics

The issue of whether the Bitcoin protocol should be used for data storage is contentious, and I apologize to anyone who thinks this design is in poor taste.
The Protocol

Addresses of peers will be stored in Bitcoin transactions using OP_RETURN transaction outputs. In order to discover peers for the application, a client will look through the blockchain, returning information from relevant transactions to the application. To provide fast reads, the protocol only allows information to be written in blocks of particular height.
What is stored?

The data pushed after OP_RETURN will consist of:

Two bytes, V0 in ASCII, identifying the message as belonging to the Vespucci protocol.
A few additional bytes identifying the application.
A zero byte to mark the end of the application ID.
A list of compressed addresses.

Since the Blockchain is a shared resource, we want to be sure to use it wisely. We can use space efficiently by compressing addresses and allowing multiple addresses to be batched into the same transaction.
Addresses

Uncompressed addresses will be zero-terminated generic URIs. This allows us to support IP addresses as well as hostnames.
scheme:[//[user:password@]host[:port]][/]path[?query][#fragment]
Information to include or not include:

We may or may not need to store scheme (e.g. http or magnet), because that information can may be inferred by the application.
It doesn't make much sense to include a user and password in Bitcoin, a publicly-viewable store!
The host field will always be populated
The port field may be populated, or it may be inferred by the application
The path, query, and fragment fields should not be populated

Compression

In order to compress URLs, we'll want to use a small-string compressing library specifically trained on URL-looking data. shoco allows users to do just this.
Downloading Data

Download Order

The Bitcoin blockchain is a log-structured data store, optimized for great write performance, but not designed for reading. The blockchain currently grows by about 25 GB/year, and is not indexed only by time. This presents a dilemma: a P2P application that's five years old would have to download 125 GB of data if searching linearly, even in the unlikely event that the block size limit is not increased. That's a lot of data to download just to get some addresses! So, how can we turn the write-optimized blockchain into a read-optimized discovery store?
To solve this, we download and look through blocks in an order that maximizes read performance: in Vespucci, information can only be stored in blocks of particular height. Of course, this means that one might need to wait hours, days, or weeks to make a write! The algorithm will be as follows:

WLOG, label the first block that we care about block 0. Label subsequent blocks in height order: 1, 2, 3, ...
Define the current maximum block height as M.


Set integer K such that K := floor(log2(M+1)).
Download and look at blocks where block height B is a multiple of 2 ^ K, skipping those that we have already looked at. Should we look at these blocks in ascending or descending order?
If K > 0, set K := K - 1. Otherwise (if K = 0), we have looked at all blocks.

This can also be thought of as writing block heights in binary and ordering by the number of zeros at the end. An example:
Blocks:    0     1     2     3     4     5     6     7     8     9     10    11    12    13
Binary:    0000  0001  0010  0011  0100  0101  0110  0111  1000  1001  1010  1011  1100  1101
End zeros: 4     0     1     0     2     0     1     0     3     0     1     0     2     0

DL order:  0  8  12  4  10  6  2  13  11  9  7  5  3  1

This means that it may be necessary to wait days, weeks or months to make a write! For discovery, this shouldn't be a showstopper.
In general, each application will pick an initial block height, B0, from which to start looking for discovery information. Just as when picking port numbers, applications should try to avoid synchronizing initial block heights, in order to avoid crowding at important blocks.
To optimize further, applications could stop looking when K reaches a certain value. For example, if we only look until K = 10, the maximum write time is a week, and the application is guaranteed to never have to look through more than 1/1024 of the blockchain, which would be 25 MB/year currently.
Verifying Downloaded Data

The downloaded data can be verified with various degrees of rigor, depending on the desired tradeoff between performance and security.
Download the entire blockchain

The most secure method of verifying the data is to download the entire blockchain in order, verifying and discarding each block as we go. This is the same strategy used by full Bitcoin clients. Since we have to look at the entire blockchain in order, this defeats the purpose of the skipping-around download order.
Download block headers only

A client could download all block headers [https://gist.github.com/gavinandresen/1059233], then use those to verify the retrieved full blocks. This is Satoshi's simplified payment verification scheme, implemented for example with headers-first mode in btcd. This could be optimized by including block header information in the application itself.
Don't verify blocks

Since most nodes in the Bitcoin network are trustworthy, it might work just fine to accept all downloaded blocks as-is. But... is it possible to request blocks by height? I think not.
Issues

Protocol interference

It's possible that non-Vespucci messages might begin with the Vespucci and application prefixes, either by chance or by malice. Given this, Vespucci clients should tolerate:

Malformed addresses
Addresses pointing to malicious bootstrap nodes
Decompression attacks taking advantage of the compression protocol

Writing data

How can we time transactions to be inserted into the blockchain at just the right block height? Should we allow a fudge factor of one block?
Ultra-fast operation

Is it possible to request blocks by height? I think not.
Ascending or descending?

Do we want to download blocks that are multiples of 2 ^ K in ascending or descending order? It seems like we might want to privilege newer information for the use case of discovery.
Related Work

Blockname is a similar project, a Bitcoin blockchain DNS cache. Blockname is trying to solve a slightly harder problem, that of creating a 1:1 mapping from domain name to server address. In Vespucci, each application may return a set of addresses.
GitTorrent includes a proposal for username registration, based on blockname.
Ultimate blockchain compression is a proposal for a hard fork of the Bitcoin network that would allow trust-free lite nodes. Ultimate blockchain compression would make Vespucci obsolete by allowing clients to do very fast lookups of address balances.