UPDATE working on this here!
so… I've had this weird idea recently...
In git (and in other secure + distributed systems) you have a tree of hashes where each object is identified by it's hash and objects contain pointers to other objects. They just have the hash of other objects stored inside them.
(this is known as a Content Addressable Store, but there does not seem to be any terminology for the design pattern of hash pointers)
Have been calling these hash pointers CYPHERLINKS (because it sounds much cooler)
git is based on cypherlinks, and so is bittorrent and bitcoin! - couchdb uses cypherlinks to version documents.
The way that couchdb uses cypher links is much simpler than git.
Git has 3 types of objects - commits, trees, and blobs. blobs are just snapshots of files, trees are snapshots of a directory and link to blobs and subtrees, commits link to a tree and a previous commit (unless it's the first version)
In couchdb - each document links to the previous version of that document.
camlistore implements much of this idea!
the cypherlink has some interesting properties:
-
cypherlinks are immutable - if you "update" a document, it's now a different document. instead, associate it with the previous version - cypherlink to the previous version.
-
cypherlinks always point back in time. You have to know the value of an object before you can link it.
-
Thus, it's impossible to create cycles - because this is impossible:
A === Hash(A + B)
for any non emptyA, B
Previously, people (including myself) have been working on "git for data", replicating the architecture of git (commits, merges, etc)
But now, I'm thinking that the best idea is just to build a system for working with cypherlinks, and then using that to build git, if you want, or some arbitrary system.
The other thing that you get with cypherlinks, is easy replication. You just have to figure out what you have that I don't have, and then send me that. You can use a merkle tree for that, I wrote a module for that last weekend.
All data is stored in the same "bucket", immutable, and addressed my it's hash.
Store objects in JSON, by their hashes, indexing everything. Sure it might take 3 times the space to store your data, but you don't have to write queries.
Build well defined sets of objects by traversing cypherlinks.
If X
cypherlinks to Y
, Y
is a part of X
.
You can index and traverse backlinks too,
but you can't validate those documents,
or verify that you have all of them.
If you want to certify a collection of backlinked objects, create a new document that cypherlinks to them.
With a merkle tree, it's easy to replicate any (sub) set of objects. Hashing is pretty fast, (0.4 seconds to hash 30mb on my laptop - that could be 1.5 million records!) so it's would be pretty easy to build a hash tree just per replication.
You could select a subset by search, or by traversing a cyphertree, then you could replicate that like git.
Except it would be pretty straightforward to build say, a blog, where there is a markdown object, some JSON metadata that pointed to it, then you sign it with crypto, and later people check this out, and add comments that link back to the article.
They could send these comments back to your article, and by adding a new tree, you'd collect everything into that.
But it doesn't need to stop there. hashed objects do not collide, so you could just put all the objects in the same namespace! I mean, all the objects on the internet!
There needs be no bounded database, just trees of cryptographically signed data floating in the ether...
@navaru as long as you take into account of the direction of the edges, it can only be a tree.
to make this work as a graph (where cycles, etc, are possible) you'd have to index the edges,
and then ignore where they came from.
For example: As a tree
That would be the view of reality you'd get from reading D,
and then reading the documents that it's hashes link to.
If you ignored the directions, (you'd have to index the links separately), you'd get this:
However, it's rather difficult to ignore the direction,
If you have a document, you know about the outgoing links,
but unless you actually have the documents that are the sources of those links,
(which would enable you to create a document linking to those sources, making it a rooted tree)
then you never know what is the total set of incoming links.
@mbrevoort so if you are worried about a important link disappearing, you just copy it to your machine.
since it's immutable, it doesn't matter who has it. You can ever redistribute it, and anyone looking for it will
be able to verify it's correct.
Also, the links are generally meaningful, although, that meaning is dependant on what the object represents,
so consider a blog post that links to images, raw text, the public key of the author, etc - that is all data that is essential to displaying the post.
Incoming links, on the other hand, can be anything... a document could be linked to by comments, other blog posts, spam... anything. In the case of comments and other posts, the author of the original document is probably eager to know of documents that link to him. If this case, they could be replicated to his machine via something like a git push!
Naturally, we will need to start using an expanded hash at some point...
Any suggestions on the best way to do so are highly welcome!