geoah/1-0-Preface.md

## 1-0-Preface.md

      
    Raw
  

              1-0-Preface.md
            
          
    Preface

TL;DR

This is a rant about how a decentralized network for storing structured data
might look like.

The goal is to allow third party developers to build applications and services
that use it as a data-store and allow users to switch between any application or
service without loosing their data.
Any feedback is greatly appreciated. Flames, opinions, comments, rants, etc.
Notes

There are a lot of things that this doc doesn't touch. Communication protocols,
APIs, Authentication, and a million more. I have been going through this for
many years now and my biggest concerns are the ones that are described here.
I'd really like to hear what others think of these and how you'd go about
solving any of these issues. Comments on the Github gist, HN, forks, emails,
anything would be awesome.
Thank you for taking the time.

Let's get to it now, shall we?
Figuring out a better way to store and share structured data

We consume and produce huge amounts of data directly or indirectly.
Creating documents, sending messages, checking-in, etc.

Some of these data you share with friends, some with family, and some you keep
to yourself.
Let's get one thing straight: I don't have a problem giving away my data to
Google, Facebook, or the next "evil-corp".
I use Gmail, Facebook, Google Drive, Dropbox; I use Slack, Wire, seventeen more
messaging apps, and in the end of it all I don't really mind them having access
to my data. If they get hacked or my data goes missing maybe I'll go the
occasional ~ meh. Google Drive, Dropbox, and other services allow developers to
use them to access your files or other data but that's about it.
I understand why I can't send an instant message, or a picture I just took to a
friend that is not using the same application I am.

Developers always want to provide the best experience, unique features, and
anything else they can provide to their users to keep them using their service.
It makes sense, monetization requires users. Happy users.
I like email; I really do. I get to choose where my e-mails are stored, I can
setup my own server, others who want to contact me can use any email provider,
any client, attach anything they wish to it. Email service providers and email
clients all use the same protocols and standards but try to differentiate
themselves from the rest of the market by focusing on features and user
experience. The one thing I don't like about email is that many users are
locked in a provider in fear of losing their email address, though having your
own domain name solves this issue relatively easily. (ps. I don't like XMPP
because it's an overcomplicated and bloated protocol that no-one wants to
support.)
Email works nicely even after so many years because it's a protocol, it's
standard and easy to implement and understand. Why can't we have something
similar for the rest of our data? A protocol to share more than messages;
images, location data, documents. Everything.

  
## 1-1-Rules.md

      
    Raw
  

              1-1-Rules.md
            
          
    Rules

Let's set some ground rules and then try to figure out how such a protocol would
look like step by step.

Users must be able to give access to third party services or applications to
create or consume data.
Third party applications must be able to store and consume any kind of
structured data.
Users must be able to share these data with others or receive data from others.
Stored data must not be vendor specific. Any application should be able to
make use of data they have been given access to.
Each user must be universally and uniquely identified (somehow).
Each user should be move their data between providers without losing data,
disconnecting third party applications, or changing their identifier.


## 1-2-TOC.md

      
    Raw
  

              1-2-TOC.md
            
          
    Table of contents


Preface


Identities


Network


Data Schemas


Data Instances


Similar Protocols


Personal Preferences


## 2-Identities.md

      
    Raw
  

              2-Identities.md
            
          
    User identities

We need a way to identify ourselves and others on this network and communicate
to one-another.
Rules.

At this point we'll assume that this is a decentralized or federated network and
that there will not be a no central identification authority.

Our options.

Hostname (user.provider.tld, user.ltd)
Full URL/URI (https://user.provider.tld, http://provider.tld/~users/user)
E-Mail like (user@provider.tld, me@user.tld)
Cryptographic fingerprint (5B6AE881FCB37CD388654D2C8D2668FEE6020367)

I would go with the last one. A public key fingerprint maybe?
Hostnames, urls, and email-like identifiers are pretty and allow users to easily
figure out who the other user is, but they quickly became problematic as it
would link users to specific providers (username.provider.tld, provider.tld/username,
username@provider.tld) and would create issues in case the user would like to
move their data to a different provider.
Personally I think PGP got it right 25 years ago; Cryptographic fingerprints.
Each user has a private/public key pair, they use their public key fingerprint
to identify each other. Something simpler than PGP/GPG would be nice, ECDSA
seems like a nice fit, sub-keys would be awesome.

Future proofing identities would require some kind of prefix/wrapping of the
fingerprint that would define the hash/key types such as
ipfs' multihash.
Using keys as a primary way of identification will allow users to sign their
data and verify data created by others. Having a way to link a public key with
a user's identity might be a nice plus. It also gives us some room to play
around with the idea of a more decentralized or peer to peer network.
On the other hand such identifiers are impossible to remember or even cross check.
DNS TXT records, HTML Meta Tags, even email headers could help with that, but
this will probably create new issues. Web of trust could help I guess, at least
in theory.
Other options such as custom user-created ids, vanity urns etc were never really
considered as they would assume the existance of a centralized registry or other
authority.
Signing and encrypting data is a whole different issue altogether as the user's
provider would need access to an unlocked key in the case of simple cryptos
(RSA, DSA, ECDSA) or at least a subkey in the case of the more involved PGP/GPG;
but let's leave this for another chapter.

  
## 3-Network.md

      
    Raw
  

              3-Network.md
            
          
    Network

Rules.

This is a decentralized peer to peer network. (Could it even be something else?)

Our options.

Decentralized, publicly accessible nodes
Decentralized, routed nodes (somehow)
Decentralized, asymmetric, routed nodes

There is a very important decision to be made here. When I think of peer to peer
networks, the first thing that comes into my mind is bit torrent, blockchains,
kademlia, etc.

In all of these networks peers are both client and server; in our network, peers
seem to be just servers. Clients are created by third party developers and
should be able to authenticate and communicate with a specific peer(s) that will
host the user's data.

To accomplish this the peers would need to be publicly accessible, or we need to
find a way to route data to them via other nodes or server (peer) initiated
two way connections. Which I am not quite sure how viable an option it is.
The last option is something a bit more complex and still not well thought out
so bear with me.

Asymmetric nodes assumes that a user can have more than one nodes online at any
given time, some of the nodes might be publicly accessible, some might not be.
The publicly accessible nodes are considered proxies and they are the nodes
the clients and services can connect to. Each identity needs to have one or more
proxy nodes in order to be discoverable by the rest of the network and by any
clients. This would also allow users to use public proxies that do not store
data but only act as an intermediate with their private nodes, eg NAS boxes,
desktop apps etc that would actually hold the data.
Asymmetric nodes are kind of my favorite as would work very nicely behind a DHT
like network, all proxy nodes would be announced on the network, and private
nodes could connect to them, redistribute data, handle backup, and servers could
allow users to specify that all videos should be cached on private nodes forever
but on proxy nodes only for a couple of days as space would be more expensive
there.

  
## 4-Data-Schemas.md

      
    Raw
  

              4-Data-Schemas.md
            
          
    Data Structures

All data are defined by schemas. Schemas are strongly typed, extendable.

Applications can be granted access to specific schemas after which they will be
able to read, create, modify, and delete all data instances of those schemas.

All data instances are cryptographically signed by their owner.
Rules.

Schemas must be versioned and their versions must be immutable and verifiable.
Schemas should be forward and backwards compatible. Using a new version of a
schema should not break existing data. Some of the existing binary formats
already support this out of the box.
Instances created via these schemas should be reversible. eg. Having an
instance we should be able to figure out the schema. (Good to have.)

Our options.

XML, RDF, XML-Something-Else
JSON-LD, JSON-Schema, JSON-Something-Else
AVRO, Protobuf, Thrift, Binary-Something-Else

If the goal of this protocol is to allow data interoperability between applications,
it makes sense to require applications to stick to specific data formats (schemas).
It would allow for example users to exchange messages between them using different
applications that all can read/write the same "message" schema.
Having schemas will also allow applications to request access only to specific
parts of their data. An application can get access to just messages and images,
or even a more limited subset of messages.
The "defined by the protocol" part is the most painful decision. My first thought
was to allow the developers to define their own schemas, something like XML
Namespaces, Linked Data, etc. This would give more options to the developers but
it would also tempt them to ignore existing schemas and create their own, as well
as the fact that requiring permission for schemas "urn:ietf:params:xml:ns:vcard-4.0"
or "http://foo.com/contact-card" doesn't really make any sense to the user.
To remedy this we allow applications to define additional attributes to schemas.
eg. A photo app that uses the "photo" schema can add a "filter" attribute to the
photos it stores or a "people" attribute containing the people that are in the
picture. Using a different application the user might lose some of this additional
information but the information that was in the original schema is still there.
Using a strongly typed format such as
AVRO,
Protobuf,
Thrift,
Flatbuffers,
etc will make signing the data structure easier than using a JSON based format,
the canonicalization of which is a problem.
Issues.
Selecting a format to represent schemas.
There are various options out there
JSON-LD,
JSON-Schema,
JSON Activity Streams.
Most JSON schemas seem a bit too limiting at times, don't really play nice with
binary, a bit too generic. XML is always
an option but... I don't wanna really go there. Why not use something that was
designed just for this reason? AVRO,
Protobuf,
Thrift,
Flatbuffers. They are strongly typed,
support binary, can generate your data structures for your favorite language,
and at least protobuf
and avro have
stardardized JSON mappings. Schema evolution more or less
already there.

Unfortunately there is a lot less work being done here for schema normalization,
fingerprinting, linking, than in JSON/XML schemas so new issues arise.

  
## 5-Data-Instances.md

      
    Raw
  

              5-Data-Instances.md
            
          
    Data Instances

aka. What happens when you share something to someone.
Rules.

Users must be able to access data that have been shared to them even if the
server of the user that shared it to them is offline.
All transferred data must be signed and must be verifiable.
Users should be able to specify permissions for shared instances.

I think it makes sense when an application asks for all your messages not to
have go through all the different servers of the users who have shared messages
with you. The only way to solve this is to actually keep a copy of all data
that have been shared with you locally.
An optional way to specify importance of specific schemas, instances, or users
might be required to limit the amount of shared data a provider can keep. Data
rotting is an additional and fun way that introduces fungi-like functionality
that destroys data depending on how frequently they are accessed.
A standardized way to sign stuff would be nice, something like
JOSE but for one of the binary
formats would be nice. (AVRO, Thrift, Protobuf etc) as JSON kinda sucks for binary
payloads and in addition JSON normalization is the stuff of legends.
Sharing is a big pain as always. I am not sure wether the permissions should be
part of the actual instance or something stored alongside it. There are cases
where the permissions make sense to be part of the instance (an instant message
could include the people it is shared with) but what happens when you want to
allow users to re-assign permissions? Does it get copied and re-signed by the
user that re-shares it? Or does the instance stay intact and it just gets
forwarded to the new users who now have access to it? And if so, how do we handle
updates from the original author? Do updates due to modifications cascade from
node to node?

  
## 9-Similar-Protocols.md

      
    Raw
  

              9-Similar-Protocols.md
            
          
    Similar Protocols


Tent - A protocol for personal data and communications.
OStatus - Open standard for distributed status updates.
FOSP - An application-level protocol for exchanging structured and unstructured data.
Matrix.org - Decentralised Group Communication.
Telehash - An embeddable private network stack for mobile, web, and devices.
Diaspora - Distributed social network.


## 99-Personal-Preferences.md

      
    Raw
  

              99-Personal-Preferences.md
            
          
    Personal Preferences

These are my personal favorites.

Identities: Cryptographic fingerprint, multi-hash.
Network: Decentralized, DHT based, asymmetric with required proxied nodes.
Data Schema: Binary, with normalization and fingerprint.
Data Instances: Binary, with a wrapper that contains the payload, signature, and permissions.