creationix/packages.md

## packages.md

      
    Raw
  

              packages.md
            
          
Distributed database with content-addressable values
Node types are blobs and trees.
Hashes use a secure hash (not sha1) since distributed p2p systems contain many untrusted nodes. (Or we could let security be a concern of a higher-level and just use this for the hard network scale problem)
The actual named refs can still be in a central server with user authorization.  Making that secure in a distributed system is considerably harder.

For example, suppose I wanted to download the package foo at the version ~0.2.3.  My package client would go to the central ref server and query the tags of foo.  It would then see that foo has versions 0.2.3, 0.2.4, and 0.3.0.  It would decide that 0.2.4 is the desired version based on the original range query.  Along with the refs the central server also told it the hashes of the root tree node for these package versions.  It now knows which hash contains the desired package.
Now to get the package, we need to query the distributed network for that hash.  We could ask on the local network among peers using udp.  If none of the local peers have it, we could go to a list of public hash servers and ask them.  How exactly we get the data doesn't really matter.  We just need to find the first computer that has the desired hash.  We then download the tree object and decode it reading the tree contents.  This will contain hashes for blobs and other nested trees.  We will continue grabbing hashes till we have all the files from that package.  Some of these files we will have already seen before either from other packages or from past versions of this package.  They will be reused from the local cache.  There is never a need to store duplicate data twice ever.  This is one of the main benefits of content-addressable databases.
Now publishing a new package version to the cloud is a little harder.  We first need push copies of the relevant hashes to a few publicly known servers so that others can find it.  We can also host them locally for people nearby.  Once we've uploaded the data to enough mirrors that we're satisfied others can find it, we authenticate with the central refs database and create the ref for that package version pointing to the hash.
Notice that the central refs database is actually optional.  If you know the ref of the package you want to depend on, you can store that (the hash) directly in your dependency metadata.  You won't get semver ranges, but you also are sure that the content you depend on today will be the content you depend on tomorrow.  If the author even does a force-push over the existing release ref with a new hash, you're still pointing to the old ref directly.  You could publish refs for your own packages outside of the central ref server and addons to the client could learn to look there instead of the central server.
There are many possible variations of this idea.  The main concept is that all data is stored and retrieved one hash as a time from a distributed cloud.  Making this fast and efficient is the hardest part of the engineering effort.  Once done it will offload most of the work from the central server that handles the main namespace of public packages.
Also publishing packages to other places can still reuse the cloud for data transfer.
This system is not efficient for huge files that contain frequent changes, but source-code doesn't tend to fit into this pattern.  This system is really nice for tree-based structures containing many small files where most files don't change between revisions.