Vespucci is a proposed protocol that uses the Bitcoin blockchain for decentralized application discovery.
Although there have been other proposals on this topic, Vespucci is to my knowledge the first one that does not require the entire blockchain to be downloaded. This results in a massive performance improvement.
In order for an application to join a P2P network, the application must somehow find the IP address of one or more peers to connect to. This is called bootstrapping, and it can be difficult. BitTorrent and Bitcoin clients typically include hard-coded addresses of bootstrapping nodes, long-lived server nodes. This works well for established P2P networks with multiple bootstrap nodes like Bitcoin and BitTorrent. It doesn't work well for new and/or small P2P networks, because a small network might only have a single bootstrap node, creating a single point of failure.
Vespucci is designed to provide discovery for applications with the following requirements:
- The P2P application needs to download a global list of peers when it's first opened.
- Anyone should be able to register bootstrapping peers for discovery.
- Read performance, i.e. the time to read peers and join the network, is very important.
- Write performance, i.e. the time to register a new peer, is very unimportant.
- The ability to discover the Bitcoin network is assumed.
The issue of whether the Bitcoin protocol should be used for data storage is contentious, and I apologize to anyone who thinks this design is in poor taste.
Addresses of peers will be stored in Bitcoin transactions using OP_RETURN transaction outputs. In order to discover peers for the application, a client will look through the blockchain, returning information from relevant transactions to the application. To provide fast reads, the protocol only allows information to be written in blocks of particular height.
The data pushed after OP_RETURN will consist of:
- Two bytes,
V0
in ASCII, identifying the message as belonging to the Vespucci protocol. - A few additional bytes identifying the application.
- A zero byte to mark the end of the application ID.
- A list of compressed addresses.
Since the Blockchain is a shared resource, we want to be sure to use it wisely. We can use space efficiently by compressing addresses and allowing multiple addresses to be batched into the same transaction.
Uncompressed addresses will be zero-terminated generic URIs. This allows us to support IP addresses as well as hostnames.
scheme:[//[user:password@]host[:port]][/]path[?query][#fragment]
Information to include or not include:
- We may or may not need to store
scheme
(e.g.http
ormagnet
), because that information can may be inferred by the application. - It doesn't make much sense to include a
user
andpassword
in Bitcoin, a publicly-viewable store! - The
host
field will always be populated - The
port
field may be populated, or it may be inferred by the application - The
path
,query
, andfragment
fields should not be populated
In order to compress URLs, we'll want to use a small-string compressing library specifically trained on URL-looking data. shoco allows users to do just this.
The Bitcoin blockchain is a log-structured data store, optimized for great write performance, but not designed for reading. The blockchain currently grows by about 25 GB/year, and is not indexed only by time. This presents a dilemma: a P2P application that's five years old would have to download 125 GB of data if searching linearly, even in the unlikely event that the block size limit is not increased. That's a lot of data to download just to get some addresses! So, how can we turn the write-optimized blockchain into a read-optimized discovery store?
To solve this, we download and look through blocks in an order that maximizes read performance: in Vespucci, information can only be stored in blocks of particular height. Of course, this means that one might need to wait hours, days, or weeks to make a write! The algorithm will be as follows:
- WLOG, label the first block that we care about block 0. Label subsequent blocks in height order: 1, 2, 3, ...
- Define the current maximum block height as
M
.
- Set integer
K
such thatK := floor(log2(M+1))
. - Download and look at blocks where block height
B
is a multiple of2 ^ K
, skipping those that we have already looked at. Should we look at these blocks in ascending or descending order? - If
K > 0
, setK := K - 1
. Otherwise (ifK = 0
), we have looked at all blocks.
This can also be thought of as writing block heights in binary and ordering by the number of zeros at the end. An example:
Blocks: 0 1 2 3 4 5 6 7 8 9 10 11 12 13
Binary: 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101
End zeros: 4 0 1 0 2 0 1 0 3 0 1 0 2 0
DL order: 0 8 12 4 10 6 2 13 11 9 7 5 3 1
This means that it may be necessary to wait days, weeks or months to make a write! For discovery, this shouldn't be a showstopper.
In general, each application will pick an initial block height, B0
, from which to start looking for discovery information. Just as when picking port numbers, applications should try to avoid synchronizing initial block heights, in order to avoid crowding at important blocks.
To optimize further, applications could stop looking when K
reaches a certain value. For example, if we only look until K = 10
, the maximum write time is a week, and the application is guaranteed to never have to look through more than 1/1024 of the blockchain, which would be 25 MB/year currently.
The downloaded data can be verified with various degrees of rigor, depending on the desired tradeoff between performance and security.
The most secure method of verifying the data is to download the entire blockchain in order, verifying and discarding each block as we go. This is the same strategy used by full Bitcoin clients. Since we have to look at the entire blockchain in order, this defeats the purpose of the skipping-around download order.
A client could download all block headers [https://gist.github.com/gavinandresen/1059233], then use those to verify the retrieved full blocks. This is Satoshi's simplified payment verification scheme, implemented for example with headers-first mode in btcd. This could be optimized by including block header information in the application itself.
Since most nodes in the Bitcoin network are trustworthy, it might work just fine to accept all downloaded blocks as-is. But... is it possible to request blocks by height? I think not.
It's possible that non-Vespucci messages might begin with the Vespucci and application prefixes, either by chance or by malice. Given this, Vespucci clients should tolerate:
- Malformed addresses
- Addresses pointing to malicious bootstrap nodes
- Decompression attacks taking advantage of the compression protocol
How can we time transactions to be inserted into the blockchain at just the right block height? Should we allow a fudge factor of one block?
Is it possible to request blocks by height? I think not.
Do we want to download blocks that are multiples of 2 ^ K
in ascending or descending order? It seems like we might want to privilege newer information for the use case of discovery.
Blockname is a similar project, a Bitcoin blockchain DNS cache. Blockname is trying to solve a slightly harder problem, that of creating a 1:1 mapping from domain name to server address. In Vespucci, each application may return a set of addresses.
GitTorrent includes a proposal for username registration, based on blockname.
Ultimate blockchain compression is a proposal for a hard fork of the Bitcoin network that would allow trust-free lite nodes. Ultimate blockchain compression would make Vespucci obsolete by allowing clients to do very fast lookups of address balances.