Skip to content

Instantly share code, notes, and snippets.

@creationix
Last active December 29, 2015 11:09
Show Gist options
  • Save creationix/7662283 to your computer and use it in GitHub Desktop.
Save creationix/7662283 to your computer and use it in GitHub Desktop.
  • Distributed database with content-addressable values
  • Node types are blobs and trees.
  • Hashes use a secure hash (not sha1) since distributed p2p systems contain many untrusted nodes. (Or we could let security be a concern of a higher-level and just use this for the hard network scale problem)
  • The actual named refs can still be in a central server with user authorization. Making that secure in a distributed system is considerably harder.

For example, suppose I wanted to download the package foo at the version ~0.2.3. My package client would go to the central ref server and query the tags of foo. It would then see that foo has versions 0.2.3, 0.2.4, and 0.3.0. It would decide that 0.2.4 is the desired version based on the original range query. Along with the refs the central server also told it the hashes of the root tree node for these package versions. It now knows which hash contains the desired package.

Now to get the package, we need to query the distributed network for that hash. We could ask on the local network among peers using udp. If none of the local peers have it, we could go to a list of public hash servers and ask them. How exactly we get the data doesn't really matter. We just need to find the first computer that has the desired hash. We then download the tree object and decode it reading the tree contents. This will contain hashes for blobs and other nested trees. We will continue grabbing hashes till we have all the files from that package. Some of these files we will have already seen before either from other packages or from past versions of this package. They will be reused from the local cache. There is never a need to store duplicate data twice ever. This is one of the main benefits of content-addressable databases.

Now publishing a new package version to the cloud is a little harder. We first need push copies of the relevant hashes to a few publicly known servers so that others can find it. We can also host them locally for people nearby. Once we've uploaded the data to enough mirrors that we're satisfied others can find it, we authenticate with the central refs database and create the ref for that package version pointing to the hash.

Notice that the central refs database is actually optional. If you know the ref of the package you want to depend on, you can store that (the hash) directly in your dependency metadata. You won't get semver ranges, but you also are sure that the content you depend on today will be the content you depend on tomorrow. If the author even does a force-push over the existing release ref with a new hash, you're still pointing to the old ref directly. You could publish refs for your own packages outside of the central ref server and addons to the client could learn to look there instead of the central server.

There are many possible variations of this idea. The main concept is that all data is stored and retrieved one hash as a time from a distributed cloud. Making this fast and efficient is the hardest part of the engineering effort. Once done it will offload most of the work from the central server that handles the main namespace of public packages.

Also publishing packages to other places can still reuse the cloud for data transfer.

This system is not efficient for huge files that contain frequent changes, but source-code doesn't tend to fit into this pattern. This system is really nice for tree-based structures containing many small files where most files don't change between revisions.

@creationix
Copy link
Author

I should mention, that while there are theoretical attacks against sha1 as used by git, there have been no actual collisions found. Also the kind of attack that would allow someone to poison the distributed database with evil code is extremely difficult since git objects also contain the length in bytes as part of the header. This secondary checksum makes it considerably more expensive to find a collision and breaks most known attack techniques.

But the more likely issue with using sha1 for a distributed database is the psychological effect it has on people upon hearing that it's an unsafe hash. We could skip all those problems and just switch to one of the sha2 variants like sha512 or sha256.

@creationix
Copy link
Author

A quick optimization we could add is the ability to transfer a tree by hash complete with all it's children in a packed format using deflate and inter-object deltas. This is the most common use-case anyway. When unpacking the tree, there would be some redundant objects, but overall it's probably a lot lower network overhead.

@creationix
Copy link
Author

We could also add "release" objects that act a lot like commit objects. They would point to the root tree and point to the previous release. They would include metadata about the release like a changelog and who cut the release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment