Skip to content

Instantly share code, notes, and snippets.

@geoah
Last active March 2, 2016 17:42
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save geoah/bd2987ed687a289275aa to your computer and use it in GitHub Desktop.
Save geoah/bd2987ed687a289275aa to your computer and use it in GitHub Desktop.
[RFC]

Preface

TL;DR

This is a rant about how a decentralized network for storing structured data might look like.
The goal is to allow third party developers to build applications and services that use it as a data-store and allow users to switch between any application or service without loosing their data.

Any feedback is greatly appreciated. Flames, opinions, comments, rants, etc.

Notes

There are a lot of things that this doc doesn't touch. Communication protocols, APIs, Authentication, and a million more. I have been going through this for many years now and my biggest concerns are the ones that are described here.

I'd really like to hear what others think of these and how you'd go about solving any of these issues. Comments on the Github gist, HN, forks, emails, anything would be awesome.

Thank you for taking the time.
Let's get to it now, shall we?

Figuring out a better way to store and share structured data

We consume and produce huge amounts of data directly or indirectly. Creating documents, sending messages, checking-in, etc.
Some of these data you share with friends, some with family, and some you keep to yourself.

Let's get one thing straight: I don't have a problem giving away my data to Google, Facebook, or the next "evil-corp".

I use Gmail, Facebook, Google Drive, Dropbox; I use Slack, Wire, seventeen more messaging apps, and in the end of it all I don't really mind them having access to my data. If they get hacked or my data goes missing maybe I'll go the occasional ~ meh. Google Drive, Dropbox, and other services allow developers to use them to access your files or other data but that's about it.

I understand why I can't send an instant message, or a picture I just took to a friend that is not using the same application I am.
Developers always want to provide the best experience, unique features, and anything else they can provide to their users to keep them using their service. It makes sense, monetization requires users. Happy users.

I like email; I really do. I get to choose where my e-mails are stored, I can setup my own server, others who want to contact me can use any email provider, any client, attach anything they wish to it. Email service providers and email clients all use the same protocols and standards but try to differentiate themselves from the rest of the market by focusing on features and user experience. The one thing I don't like about email is that many users are locked in a provider in fear of losing their email address, though having your own domain name solves this issue relatively easily. (ps. I don't like XMPP because it's an overcomplicated and bloated protocol that no-one wants to support.)

Email works nicely even after so many years because it's a protocol, it's standard and easy to implement and understand. Why can't we have something similar for the rest of our data? A protocol to share more than messages; images, location data, documents. Everything.

Rules

Let's set some ground rules and then try to figure out how such a protocol would look like step by step.

  • Users must be able to give access to third party services or applications to create or consume data.
  • Third party applications must be able to store and consume any kind of structured data.
  • Users must be able to share these data with others or receive data from others.
  • Stored data must not be vendor specific. Any application should be able to make use of data they have been given access to.
  • Each user must be universally and uniquely identified (somehow).
  • Each user should be move their data between providers without losing data, disconnecting third party applications, or changing their identifier.

Table of contents

    1. Preface
    1. Identities
    1. Network
    1. Data Schemas
    1. Data Instances
    1. Similar Protocols
    1. Personal Preferences

User identities

We need a way to identify ourselves and others on this network and communicate to one-another.

Rules.

  • At this point we'll assume that this is a decentralized or federated network and that there will not be a no central identification authority.

Our options.

  • Hostname (user.provider.tld, user.ltd)
  • Full URL/URI (https://user.provider.tld, http://provider.tld/~users/user)
  • E-Mail like (user@provider.tld, me@user.tld)
  • Cryptographic fingerprint (5B6AE881FCB37CD388654D2C8D2668FEE6020367)

I would go with the last one. A public key fingerprint maybe?

Hostnames, urls, and email-like identifiers are pretty and allow users to easily figure out who the other user is, but they quickly became problematic as it would link users to specific providers (username.provider.tld, provider.tld/username, username@provider.tld) and would create issues in case the user would like to move their data to a different provider.

Personally I think PGP got it right 25 years ago; Cryptographic fingerprints. Each user has a private/public key pair, they use their public key fingerprint to identify each other. Something simpler than PGP/GPG would be nice, ECDSA seems like a nice fit, sub-keys would be awesome.
Future proofing identities would require some kind of prefix/wrapping of the fingerprint that would define the hash/key types such as ipfs' multihash.

Using keys as a primary way of identification will allow users to sign their data and verify data created by others. Having a way to link a public key with a user's identity might be a nice plus. It also gives us some room to play around with the idea of a more decentralized or peer to peer network.

On the other hand such identifiers are impossible to remember or even cross check. DNS TXT records, HTML Meta Tags, even email headers could help with that, but this will probably create new issues. Web of trust could help I guess, at least in theory.

Other options such as custom user-created ids, vanity urns etc were never really considered as they would assume the existance of a centralized registry or other authority.

Signing and encrypting data is a whole different issue altogether as the user's provider would need access to an unlocked key in the case of simple cryptos (RSA, DSA, ECDSA) or at least a subkey in the case of the more involved PGP/GPG; but let's leave this for another chapter.

Network

Rules.

  • This is a decentralized peer to peer network. (Could it even be something else?)

Our options.

  • Decentralized, publicly accessible nodes
  • Decentralized, routed nodes (somehow)
  • Decentralized, asymmetric, routed nodes

There is a very important decision to be made here. When I think of peer to peer networks, the first thing that comes into my mind is bit torrent, blockchains, kademlia, etc.
In all of these networks peers are both client and server; in our network, peers seem to be just servers. Clients are created by third party developers and should be able to authenticate and communicate with a specific peer(s) that will host the user's data.
To accomplish this the peers would need to be publicly accessible, or we need to find a way to route data to them via other nodes or server (peer) initiated two way connections. Which I am not quite sure how viable an option it is.

The last option is something a bit more complex and still not well thought out so bear with me.
Asymmetric nodes assumes that a user can have more than one nodes online at any given time, some of the nodes might be publicly accessible, some might not be. The publicly accessible nodes are considered proxies and they are the nodes the clients and services can connect to. Each identity needs to have one or more proxy nodes in order to be discoverable by the rest of the network and by any clients. This would also allow users to use public proxies that do not store data but only act as an intermediate with their private nodes, eg NAS boxes, desktop apps etc that would actually hold the data.

Asymmetric nodes are kind of my favorite as would work very nicely behind a DHT like network, all proxy nodes would be announced on the network, and private nodes could connect to them, redistribute data, handle backup, and servers could allow users to specify that all videos should be cached on private nodes forever but on proxy nodes only for a couple of days as space would be more expensive there.

Data Structures

All data are defined by schemas. Schemas are strongly typed, extendable.
Applications can be granted access to specific schemas after which they will be able to read, create, modify, and delete all data instances of those schemas.
All data instances are cryptographically signed by their owner.

Rules.

  • Schemas must be versioned and their versions must be immutable and verifiable.
  • Schemas should be forward and backwards compatible. Using a new version of a schema should not break existing data. Some of the existing binary formats already support this out of the box.
  • Instances created via these schemas should be reversible. eg. Having an instance we should be able to figure out the schema. (Good to have.)

Our options.

  • XML, RDF, XML-Something-Else
  • JSON-LD, JSON-Schema, JSON-Something-Else
  • AVRO, Protobuf, Thrift, Binary-Something-Else

If the goal of this protocol is to allow data interoperability between applications, it makes sense to require applications to stick to specific data formats (schemas). It would allow for example users to exchange messages between them using different applications that all can read/write the same "message" schema.

Having schemas will also allow applications to request access only to specific parts of their data. An application can get access to just messages and images, or even a more limited subset of messages.

The "defined by the protocol" part is the most painful decision. My first thought was to allow the developers to define their own schemas, something like XML Namespaces, Linked Data, etc. This would give more options to the developers but it would also tempt them to ignore existing schemas and create their own, as well as the fact that requiring permission for schemas "urn:ietf:params:xml:ns:vcard-4.0" or "http://foo.com/contact-card" doesn't really make any sense to the user. To remedy this we allow applications to define additional attributes to schemas. eg. A photo app that uses the "photo" schema can add a "filter" attribute to the photos it stores or a "people" attribute containing the people that are in the picture. Using a different application the user might lose some of this additional information but the information that was in the original schema is still there.

Using a strongly typed format such as AVRO, Protobuf, Thrift, Flatbuffers, etc will make signing the data structure easier than using a JSON based format, the canonicalization of which is a problem.

Issues.

Selecting a format to represent schemas. There are various options out there JSON-LD, JSON-Schema, JSON Activity Streams. Most JSON schemas seem a bit too limiting at times, don't really play nice with binary, a bit too generic. XML is always an option but... I don't wanna really go there. Why not use something that was designed just for this reason? AVRO, Protobuf, Thrift, Flatbuffers. They are strongly typed, support binary, can generate your data structures for your favorite language, and at least protobuf and avro have stardardized JSON mappings. Schema evolution more or less already there.
Unfortunately there is a lot less work being done here for schema normalization, fingerprinting, linking, than in JSON/XML schemas so new issues arise.

Data Instances

aka. What happens when you share something to someone.

Rules.

  • Users must be able to access data that have been shared to them even if the server of the user that shared it to them is offline.
  • All transferred data must be signed and must be verifiable.
  • Users should be able to specify permissions for shared instances.

I think it makes sense when an application asks for all your messages not to have go through all the different servers of the users who have shared messages with you. The only way to solve this is to actually keep a copy of all data that have been shared with you locally.

An optional way to specify importance of specific schemas, instances, or users might be required to limit the amount of shared data a provider can keep. Data rotting is an additional and fun way that introduces fungi-like functionality that destroys data depending on how frequently they are accessed.

A standardized way to sign stuff would be nice, something like JOSE but for one of the binary formats would be nice. (AVRO, Thrift, Protobuf etc) as JSON kinda sucks for binary payloads and in addition JSON normalization is the stuff of legends.

Sharing is a big pain as always. I am not sure wether the permissions should be part of the actual instance or something stored alongside it. There are cases where the permissions make sense to be part of the instance (an instant message could include the people it is shared with) but what happens when you want to allow users to re-assign permissions? Does it get copied and re-signed by the user that re-shares it? Or does the instance stay intact and it just gets forwarded to the new users who now have access to it? And if so, how do we handle updates from the original author? Do updates due to modifications cascade from node to node?

Similar Protocols

  • Tent - A protocol for personal data and communications.
  • OStatus - Open standard for distributed status updates.
  • FOSP - An application-level protocol for exchanging structured and unstructured data.
  • Matrix.org - Decentralised Group Communication.
  • Telehash - An embeddable private network stack for mobile, web, and devices.
  • Diaspora - Distributed social network.

Personal Preferences

These are my personal favorites.

  • Identities: Cryptographic fingerprint, multi-hash.
  • Network: Decentralized, DHT based, asymmetric with required proxied nodes.
  • Data Schema: Binary, with normalization and fingerprint.
  • Data Instances: Binary, with a wrapper that contains the payload, signature, and permissions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment