Skip to content

Instantly share code, notes, and snippets.

@paulkernfeld
Last active September 5, 2016 19:29
Show Gist options
  • Save paulkernfeld/4278533bf83887f6f0ee67765c66d54d to your computer and use it in GitHub Desktop.
Save paulkernfeld/4278533bf83887f6f0ee67765c66d54d to your computer and use it in GitHub Desktop.
Exandria: A decentralized public file library

Exandria: A decentralized public file library

There are many great file repositories online: WikiLeaks and The Internet Archive are two prominent examples. Even these two notable sites are centralized, however. This means that censorship-happy governments can and do find ways to attack these sites by going after their founders, servers, or DNS records.

This article describes Exandria, a design for a file library with decentralized ownership. Any owner can vote to add or remove content, making Exandria extremely difficult to censor.

Properties:

  • Searching: users can search filenames
  • Downloading: the ability to download files
  • Censorship-resistant: it's hard to censor files
  • Spam-resistant: it's hard to spam the system
  • Zero-configuration: a new user can join the network instantly, without an invite

Many systems have some of these properties, but not all of them. If you know of another system that does or tries to do this, I would love to hear about it!

Overview

  • The Bitcoin blockchain is used as the canonical data source, containing links to file metadata.
  • One writes to the library by burning bitcoins.
  • Anyone can read from the library.
  • All data is transferred via P2P networks.

Data Model

The file library is represented as a mutable weighted set of files.

  • Set: an unordered collection of files.
  • Mutable: the library can change over time (i.e. files can be added and removed).
  • Weighted: each file has a weight, to represent that some files are more important than others.

Each file has a title, an extension, and a reference to the content of the file.

Here's an example of what this set might look like. Note that the "magnet" field contains a hash of the content of the file, which doubles as a way to locate the file.

{
    "name": "Moby Dick",
    "ext": "epub",
    "magnet": "magnet:?xt=urn:sha1:YNCKHTQCWBTRNJIV4WNAE52SJUQCZO5C",
},
{
    "name": "Scott Joplin - The Entertainer",
    "ext": "mp3",
    "magnet": "magnet:?xt=urn:sha1:NAE52SJUQCZO5CYNCKHTQCWBTRNJIV4W",
},
...

Delta Encoding

Delta encoding is a way to represent a set using an append-only log. Changes to the set are represented as additions and removals.

A metadata stream "add" entry might look like this:

{
    "op": "add",
    "value": {
        "etx": "epub",
        "magnet": "magnet:?xt=urn:sha1:YNCKHTQCWBTRNJIV4WNAE52SJUQCZO5C",
        "title": "Moby Dick"
    }
}   

Each add entry will have a hash, which will be a hash of the serialized JSON of value, specifically ripemd160(sha256(value)). Multiple identical add entries will have the same hash, although it's possible for two add entries to be functionally identical but to have different hashes, e.g. if the JSON has different whitespace.

A metadata "remove" entry simply refers to the hash of an entry that has been added:

{
    "op": "remove",
    "hash": "c3f0fe269c05c3438d57c40526bd016d"
}

Weights

Since this is a weighted multi-set, we need some way to make weights work with delta encoding. We can say that each "add" operation increments the weight of an element and each "remove" operation decrements the weight of the element. An element is only a member of the final set if it has a positive weight.

This delta encoding:

{
    "op": "add",
    "value": "cat",
    "weight": 5
},
{
    "op": "add",
    "value": "cat",
    "weight": 8
},
{
    "op": "add",
    "value": "hat",
    "weight": 4
},
{
    "op": "remove",
    "hash": "80c7384a6339a053baee278cb13e578c",  // the hash of "hat"
    "weight": 6
}

...results in this set ("hat" has weight -2, so it's excluded):

{
    "value": "cat",
    "weight": 13
}

Architecture

[diagram 1]

The protocol consists of three layers:

  1. A layer which stores identities and commitments in the Bitcoin blockchain
  2. A layer for storing streams of file metadata
  3. A layer that can retrieve a file given a magnet link, using BitTorrent

Identities and commitments

Unlike in many other systems, identities are not stored in a centralized database. Instead, they are stored on the Bitcoin blockchain.

Each identity is just a cryptographic key pair. Identity public keys are recorded onto the Bitcoin blockchain.

You only need an identity to write to Exandria; you can read from it without an identity.

There are no moderators or administrators; all users are of the same type.

Identity weights

In order to become a contributor to the library, the user must burn some bitcoins to prove their commitment. This prevents spamming.

Each identity has an associated "weight," which is the total amount of bitcoins burned by that identity. This is used to relatively prioritize the contributions of each identity; the files posted by an identity with a higher weight will be ranked higher in search results.

Writing

In order to create a new identity or increase the commitment of an existing identity, the user burns some bitcoins.

Reading

In order to download a list of all identities, a client can look through the Bitcoin blockchain. This can be optimized to work with simplified payment verification.

Implementation

Burns will use a burn stream.

Each message will be a single commitment. It will include a version byte (0x00) and then a public key.

Metadata Streams

A file library requires storing a lot of data, but storing data on the Bitcoin blockchain is very expensive. Therefore, only identities are stored on the blockchain, and file metadata is stored separately, in metadata streams.

Each identity has exactly one metadata stream, which is a distributed append-only log. The stream is identified and discovered by the public key of the identity, and the contents of the log must be signed by the identity's private key. This paradigm is already used in secure-scuttlebutt and ppspp.

A metadata stream contains the add and remove operations allowing the user to make modifications to the global set of files.

The "weight" of an add or remove entry is measured in bitcoins. It is the weight of the identity divided by the total number of entries in the identity's metadata stream.

Searching

In order to search, the client will need to first download and index all metadata entries from all identities. Then, the client can perform the search locally. As new entries come in, the client will need to update its search index.

Retrieving Files

Each metadata entry will contain the a magnet link to a file, as well as the extension of the file. Given this, the file can be retrieved using BitTorrent and saved to the user's file system. This means that, in order to retrieve a file, there must be some node connected to the network that is storing that file.

Since a magnet link is basically just a hash of a file, it should be easy to substitute in other ways to retrieve files.

Issues

Objective vs. subjective weights

The goal of Exandria is to provide an objective weight for each element in the library. This may seem strange, since different files are relevant to different people. The reason for assigning objective weights is simple: new users should be able to start using the system instantly, without having to first decide whom they want to trust.

See "subjective trust filters" below for a way to layer subjective trust on top of the objective weights.

Index-writing incentives

People with a legitimate interest in the store must be willing to burn Bitcoins, so this scheme relies on the goodwill of people.

Here are some parties who might participate and affect the content of the library:

  • Individuals spending money on their hobbies
  • Non-profits donating money to fight censorship
  • Governments spending money to censor information
  • Corporations or lobby groups spending money on advertising

File storage incentives

Nodes that are connected to the BitTorrent network can altruistically provide data to new nodes joining the network. This means that the BitTorrent network is good for exchanging popular files but bad for exchanging unpopular files. This will make Exandria primarily useful for exchanging files that are of general interest.

Censorship is inevitable

If we don't allow items to be removed from the file library, will we have a library without censorship?

Not really. Even without removal, content can be effectively censored by creating spam entries with the correct metadata but incorrect magnet links.

Therefore, instead of trying to eliminate censorship, the goal of this design is to create a library with a good signal:noise ratio.

Future upgrades

Eigentrust

An Eigentrust-style trust graph would allow identities to testify for or against each other. The trust graph will make it so that control over content is given to a majority.

Without a trust graph, a single bad actor can cause a lot of disruption by deleting popular content. With a trust graph, the community can vote to silence bad actors. Of course, this is a type of censorship, but censorship is inevitable and community censorship is probably the best kind of censorship.

Each node in the trust graph is an identity. Each edge is directed and has a weight between -1 and 1. The starting weight for each node is the node's weight from burning bitcoins. The end weight of each node is computed by taking the eigenvalue of the graph.

Subjective trust filters

All weights in Exandria are global, which is suboptimal because different users are interested in different content. So, users could add custom client-side trust filters that upweight or downweight particular identities to provide better search results. For example, I could make a filter that downweights content from known scammers and upweights content from my friends.

Since these filters will be subjective, it's possible for many such filters to exist. Subjective trust filters could even be used to define communities in the network -- for example, a "gardening" community could share a filter that upweights only gardening-related content.

Trust filters would be shared out-of-band.

Search scalability

If each metadata JSON is limited to 1000 bytes, then we'll be able to store one million search records in 1 GB. This is reasonable for a personal computer, but at some point, the size of the metadata itself is going to become prohibitive.

Eventually, a client should be able to search the metadata without storing all metadata locally. This is challenging, because the client must quickly retrieve data from other computers and verify that that data is valid.

One potential solution would be to maintain a distributed and easy-to-verify trie of search terms, much like Bitcoin's ultimate blockchain compression proposal. This would let nodes download information from the trie when performing a search.

Another potential solution would be to have nodes only download identities (not file metadata) on startup, giving them information about which identities are trustworthy. Then, nodes could perform searches by querying trusted identities, using an RPC-style setup.

Additional metadata

Many types of metadata could be added to files, e.g. book author, film/song length, or image resolution. This could add a lot of complexity, since not all types of file have well-standardized metadata specifications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment