jcoglan/package-distribution.md Secret

## package-distribution.md

      
    Raw
  

              package-distribution.md
            
          
    Package distribution

In June 2019, the Entropic project was announced. It aims to provide a federated
package distribution system for arbitrary packages, focussing initially on
JavaScript. Its storage model uses some ideas from Git, and since working
heavily on Git throughout 2018 I have been thinking about how its ideas could
transfer to package management.
This document outlines the ideas I've been thinking about in the hopes that
they're useful to the development and direction of Entropic. I expect that the
project founders will already be aware of some of this, and I'm also aware that
some of it may not be applicable or may be out of scope.
Motivation

Since learning about how Git transfers content over the network, it became clear
to me that most existing package managers could make big improvements to their
current approach to two problems: resolving dependencies, and transferring
package content. Using storage models from Git, we could significantly reduce
the size of packages on the wire and on disk. Copying Git's protocol for
fetching commits could also be used to reduce the latency of dependency
resolvers.
Adopting these ideas could improve the install time UX for developers with large
dependency trees, and users on slow or unreliable network connections.
I'm also interested in the mirroring story for package repositories. I'm an
attendee of /dev/fort, which involves a group of devs and designers going
offline for a week to build something together. We try to mirror various
software repos -- apt, cpan, pypi, rubygems, npm -- and some are a lot more
amenable to mirroring than others. Barriers to mirroring include a lack of
stable tools, ad-hoc protocols, and the sheer size of some repositories. I am
interested in separating the problem of package distribution from package
installation to reduce the burden for this kind of work.
Background

The following proposals make use of several ideas from Git. Here we'll describe
those elements of Git that are necessary to understand them.
Data model

Git's data model includes four principal types of object:

blob: a buffer of arbitrary data; essentially, the contents of a file
tree: a non-empty set of uniquely named pointers to blobs and other trees
commit: a pointer to a tree, zero or more pointers to commits, plus metadata
tag: a pointer to a commit, plus metadata

Every object has a unique ID derived from a cryptographic hash of its content,
and is therefore immutable (changing the content would change the ID). Git uses
SHA-1 but is working on migrating to a newer hash function.
Commits form a directed acyclic graph (DAG) that describes the causal history of
the project. We'll render such graphs as dots connected by lines, where the
graph's edges always point from right to left, for example:
                  o---o
                 /
o---o---o---o---o-------o---o
     \       \             /
      o---o---o---o---o---o---o
               \     /
                o---o

Any lines leading to the left from a commit identify that commit's parents.
Every commit points to a tree that contains the complete project state at that
commit. These trees are Merkle trees, a persistent data structure such that
content that is unchanged from one commit to the next is not duplicated on disk;
multiple commits can share the same version of a file by pointing at the same
on-disk blob object, rather than by copying it.
For example, the tree of one commit may contain pointers to two files:
f8d58297 hello.txt
5367fa75 world.txt

If the subsequent commit only modifies hello.txt, then its tree will have a
different pointer for hello.txt but use the same one for world.txt
a2e3af82 hello.txt
5367fa75 world.txt

Thus both trees share the world.txt blob and it is not duplicated on disk.
This idea extends to trees, so that entire unchanged directories are shared
between commits.
Object storage

Git stores these objects in one of two ways. Its "loose" format stores each
object in its own file, with its data compressed using zlib. This is highly
space-inefficient for small text files as each file ends up well under the
minimum filesystem block allocation size (e.g. 4 KiB on macOS).
The other storage format is the "packed" format in which a set of objects are
all stored in the same file; these files use the .pack extension and
essentially contain a set of objects concatenated end-to-end. This format is
used both to store objects on disk and to send them over the network. When on
disk, an accompanying .idx file contains an index recording the byte offset in
the .pack at which each object begins; this allows fast random access into a
file that would otherwise need to be read from the beginning to locate any of
its contents.
Each object in a pack is still compressed using zlib, but an additional
compression method is employed. If two objects are very similar, as is the case
when commits make small edits to existing files, then Git will designate one of
the pair as the "base" and represent the other as a binary diff against the base
object. Objects so-compressed are called "deltas" and are typically over 90%
smaller than the original object.
For example, given this pair of strings:
source: the quick brown fox jumps over the slow lazy dog
target: a swift auburn fox jumps over three dormant hounds

It is possible to represent the target string as the following series of
instructions for rebuilding it from the source string:
insert "a swift aubur"
copy   14 19
insert "ree dormant hounds"

For files that have long substrings in common, the delta replaces those common
substrings with "copy" instructions, which identify the offset and length of a
common string and are only a few bytes in size.
This compression method greatly reduces the total size of a set of objects both
on disk and over the network, at the cost of making reading these objects more
expensive. To read a delta object, we must first read its base object, and then
apply the delta to that base to reconstruct the original. That base object may
itself be stored as a delta.
To limit this increased read expense, Git imposes a limit on the length of such
delta chains when they are created, and only stores an object as a delta if it
reduces its size by at least 50%. Delta compression typically compresses an
object a lot (> 90%) or very little (< 10%), so in practice this threshold means
Git retains deltas that produce very large reductions in size.
Network transfers

The fetch and push commands transfer objects over the network; they
essentially copy portions of the commit graph between repositories. For example,
say that Alice's repository contains the following commit graph:
      F   G
      o---o
     /     \
o---o---o---o---o
A   B  C \  H   J
          \
           o---o
           D   E

Bob's repository contains a subset of this graph:
o---o---o
A   B  C \
          \
           o---o
           D   E

When Bob fetches from Alice, the Git protocol first tells Bob that Alice has
commits E and J -- the tips of the branches in that graph. Bob has E so he
ignores that, but he does not have J. Therefore he sends a message to Alice
saying he wants J but already has E; Bob saying he has commit E implies that he
has all commits that are ancestors of E. From this information, Alice performs a
local graph search to determine Bob must be missing commits F, G, H and J, and
sends only those commits to Bob. The commits and any new trees and blobs they
introduce are rolled up into a pack, which Bob streams directly to disk. Delta
compression is performed on this pack as Alice generates it, but as this is a
slow process it may reuse existing delta objects from her local pack files to
save time.
This protocol saves time by making the "server" an active decision-maker rather
than a static content repository. If Bob had to fetch just J, and then see that
he did not have its parent H, and so fetch that, and so on, the protocol would
require a lot more round-trips. When you consider that these commits point to
graphs of tree and blob objects, and the server can quickly determine which
objects the client is missing and roll all of them into a single pack, this
greatly reduces the time that the protocol spends on network round-trips. It
also requires the client to send very little data to the server to complete its
request.
Thin packs

To save even more network bandwidth, Git can use what it calls "thin" packs.
Packs stored on disk must be self-contained; any object that's used as a base
for a delta must appear in the same pack as the delta. But when sending data
over the network, the sender can use knowledge of what the receiver already has
to avoid sending any complete base objects.
Above, Alice knows Bob has commits A, B, C, D and E, and so can use any content
from them as the base for a delta in the pack she sends, without including
those base objects in the pack. (In practice only B and C will be considered as
they are direct parents of the commits being sent; Git calls these the "edge" of
the selected range.) When Bob receives this pack, he must append any missing
base objects from his own database so that the pack becomes self-contained.
Pack layout

In pack files, objects are either stored as self-contained full
blob/tree/commit/tag objects, or as deltas -- a pointer to a base object, plus a
binary diff that regenerates the original object from the base. The method Git
uses to select objects as delta pairs, and the order in which it stores them,
are informed by the likely access patterns when data is accessed on disk and
sent over the network.
Storing an object as a delta saves a lot of space but makes it more expensive to
read. Rather than reading an object's data straight off disk, Git must first
load its base object (which may itself be a delta), and then apply the binary
diff to regenerate the original object. Git limits the length of delta chains to
prevent this becoming too expensive, but it also sorts objects in a particular
way to reduce the read-time cost.
When working locally, Git commands more frequently access "recent" objects than
older ones -- e.g. the status and diff commands compare the workspace to the
current HEAD, the log command lists recent commits first. "Recent" here
strictly means closely connected to the tips of branches, and in some situations
Git will sort objects according to the history graph.
However, the pack compressor, in order to find pairs of objects that will form
good delta pairs, sorts them by size. This is primarily because objects that are
most similar to each other will tend to have similar sizes, but it's also
because size is a good proxy for age: files tend to get larger over time.
Git sorts the objects largest-first, so that deltas usually have a larger object
as their base. This means that recent objects tend to be stored literally, not
as deltas, making them cheap to access. It also means that deltas tend to have
very little inline data of their own -- they mostly consist of pointers to
chunks of content from a larger object, and are extremely small as a result.
Finally, Git also orders objects in packs such that recent objects (in the graph
connection sense, not by size or clock time) appear near the start of the file,
making exceptions for objects that are bases that need to appear before deltas
based on them. This further improves access times for recent objects on hardware
where seeks are expensive.
Proposals

What follows is a list of ways in which we could use the above ideas to improve
package distribution. I'm particularly interested in separating package
distribution from package installation; transferring files between machines
should be language-agnostic and separable from whatever it is an installer,
compiler or other build tool wishes to do with those files.
That includes concerns like where files are stored and how they are laid out on
disk; the package storage and distribution infrastructure should be free to
chose how it stores packages, and an installer should be able to request file
content without being concerned with the details of the storage design.
Within reason, package installers should be able to work offline. If an
application's dependencies can be satisfied by packages that are already cached
locally, and the user cannot connect to a registry, then the installer should
use the cached content. See cargo --offline for an example. This means the
distribution infrastructure should store enough information about packages
locally to determine what they depend on, and to fully reconstruct their
content.
Storage of content must be concurrency-safe. It is normal that multiple
instances of a package installer run concurrently on a single machine,
especially for build and CI tasks. If we reuse distribution infra for multiple
installers, this will become more common. Concurrent requests to the
distribution infrastructure that result in additions to locally cached content
should be serialisable -- they should have the same result as if all requests
were made sequentially. It may be acceptable for the physical representation of
the content to differ, but not its logical interpretation. For example, the
order requests are made in may change how the registry batches up content into
different packs, but those packs viewed in total must represent the same logical
set of objects once all requests are complete.
Terminology

I will be using the following definitions:

package: a conceptual namespace that refers to a library and its codebase
across all its releases, for example, react is a package
release: the content comprising a specific version of a package, that is,
some particular tree of files, plus some metadata e.g. a package.json file;
this tree and metadata is a logical structure, not a physical one, and is
independent of the manner in which the release is stored on disk or sent over
a network
version: a name that refers to a specific release, usually made up of a
package name and a version number, e.g. react@2.1.3
dependency: an expression denoting a set of versions, usually made up of a
package name and a semver expression that identifies a set of versions of that
package; in package.json this might be expressed as "react": ">=2.1.3 <3.0.0"

I have deliberately kept the term dependency high-level here. Most package
managers overload dependencies with other kinds of properties, for example:

optional: a dependency is not strictly required for the application to build
situational: a dependency is required only in some environment, for example
to run the application's tests
relational: this describes a dependency's position relative to another
dependency in a materialised tree of releases in systems where install layout
matters, for example npm's peerDependencies

The last of these is a concern of the package installer, not necessarily the
distributor. The others are less clear to me at this time. From the point of
view of trying separate distribution from installation, it will be clear below
that some dependency information is necessary for distribution.
The simplest solution I can imagine is that a release has a "default" dependency
set, and possibly other named sets where those names have meaning to installers.
For example an installer requests the default and test dependency sets, and
that information is given to the dependency resolver to produce the full set of
required versions.
Storage and bandwidth

We can improve the spatial efficiency of distributing packages, both on disk and
over the network. Most package managers store each release of a package as a
physically distinct artefact, for example each release is stored as a
self-contained tarball or other compressed archive format.
Releases of the same package will tend to have a lot of content in common --
they are snapshots of the state of a single codebase, usually. Especially for
patch releases, many of the files in a package may be identical from release to
release, while others may change only slightly. This mirrors the storage
characteristics of version control, where most content does not change from
commit to commit. This means it's amenable to the use of Merkle trees and delta
compression to save space on disk and on the wire.
Within a single package, we can in principle realise substantial space savings
by moving to a Merkle-tree form of storage. This would de-duplicate files that
do not change their content across multiple releases. I believe Entropic is
already using such a file-centric storage model, rather than a release-centric
one. In this model, a release becomes a pointer to a particular tree, plus some
metadata, analogous to a Git commit.
However, releases are not connected in a graph like commits are; a package's
releases are an unordered set, even though the versions that refer to the
releases may form a partial order. No particular causal relationship is assumed
between releases.
Merkle trees and delta compression provide different benefits, and do so in
different circumstances. Making best use of them requires consideration of how
releases are stored on disk and how they are sent over the network.
Merkle trees

Merkle trees let us de-duplicate files and directories that are unchanged
between releases. Whenever two or more releases are stored together, a Merkle
tree lets them share storage space. This is true regardless of how the releases
are physically serialised to disk.
It is also true if the set of stored releases is updated incrementally. If we
are sent a release that contains a file that exists in the releases we already
have, then we don't need to store a copy of it. This means that even if each
release is sent in its entirety over the network, any files or directories they
have in common will be stored only once on disk, rather than multiple times as
they would be under a tarball storage model.
On the network, we consider two distinct situations: two releases of the same
package being sent over the network together, and a single release being sent to
a receiver that already has another release of the same package.
When two or more releases are sent together, the sender can analyse their
content and generate a set of the unique files/trees across all the releases,
and thus send each object only once. As the receiver should be storing each
distinct object only once, this serialisation should be easy to map to disk
storage.
When sending releases to a receiver that already has some releases of the same
package, it may be possible to use knowledge of what the sender already has to
avoid sending redundant objects. In the Git example above, say a file exists in
commit B and remains unchanged in F, G, H and J. Since Alice knows Bob has
commit B, she does not need to send this file, even though it appears in the
commits she's sending, because Bob already has it. Git uses this technique to
reduce the number of objects sent; any objects that exist in the "edge" commits
are not sent over the network.
In Git, commits are connected in a graph. The receiver requests the commits they
want and states some commits they already have, and the sender works out which
commits in the graph the receiver is missing. The sender generates a unique set
of files and trees for these commits, and any objects present in "edge" commits
are omitted. In this workflow, the inputs are a set of commits, and the output
is a range of commits between the inputs, plus their associated content.
Releases are not connected in this way, and the data a receiver is requesting is
often different. In fact there are two kinds of client to consider.
Say that a registry exists that contains a package foo, with versions 1
through 6. One kind of client is a replicator: another registry that wants
to mirror the complete contents of the source registry. If a replicator already
has foo@1, foo@2 and foo@4, it likely wants all the releases it's missing,
and so its request would take the form:
fetch foo
want: *
have: 1, 2, 4

Because releases are not connected in a graph, the client must enumerate all the
releases it has if it wants to minimise the data the source sends it; possession
of foo@4 does not imply possession of any other release so the sender should
not assume the receiver has any release it does not explicitly mention.
The sender responds to this request by enumerating any releases the sender does
not have: 3, 5 and 6. It generates their unique content objects, omitting
any objects present in releases the receiver does have, and sends the resulting
set.
The other kind of client is an installer: it wants a specific release of a
package in order to meet a dependency (direct or transitive) of an application.
So its request can still include any versions it already has, but will mention a
specific version it wants:
fetch foo
want: 5
have: 1, 2, 4

In this case the registry should send only version 5, but it can use the fact
that the client has 1, 2 and 4 to avoid sending redundant objects.
However, both these cases introduce a performance problem: the registry will use
the list of have versions to avoid sending objects the receiver already has.
The cost of doing this scales with the number of have versions the client
sends, which will likely monotonically increase with time. It may be possible to
avoid scanning each release completely by stopping at tree IDs that have already
been visited, but to a first approximation the cost scales with the number of
have versions.
In Git this is not the case: only "edge" commits are considered when removing
redundant objects. In our example above:
      F   G
      o---o
     /     \
o---o---o---o---o
A   B  C \  H   J
          \
           o---o
           D   E

Bob requests commit J saying he already has E, so the missing range is the
commits F, G, H, J. The "edge" commits are any immediate parents of those
commits that are not selected, that is, B and C.
When Bob sends his request, he does not need to mention every commit he has;
just mentioning he has E implies he has A, B, C and D. The set of edge commits
does not scale either with the number of commits Bob says he has, or with the
size of the range he's missing. It scales with how much the graph forks and
merges, not how long the selected chain is. That means the cost of removing
redundant objects does not necessarily increase with time.
What does this mean for our package manager? One answer could be that the
contents of releases, and the mapping of versions to releases, are immutable,
and so all computations over sets of releases are eminently cacheable. However
we may want to avoid caching as the example requests given above have
combinatorial complexity of which versions a client may request and already
have.
Another possible solution is to observe that removal of redundant objects from
network transfers is a performance optimisation, not a correctness requirement,
and so it may be acceptable to receive some objects we already have. So, clients
may be required to limit the number of have versions they send. Some strategy
would be needed to select a subset of versions the client has that are most
likely to minimise the size of the download. For example, two patch releases are
likely to have a lot of objects in common and so their marginal information
content is low; whereas, major versions will differ more and so cover off more
objects.
Or, in the case of installer clients, we could prioritise versions that are
close to the requested version, as they will have the most content in common and
therefore have the greatest chance of shrinking the download size.
Delta compression

Delta compression is somewhat harder to exploit. Merkle trees, with the implied
identification of objects by their contents, means objects are trivially
deduplicated across releases by comparing their IDs. Exploiting delta
compression requires some analysis of objects' content to identify similar
objects and then replace the originals with deltas.
Registries and replicators may be interested in delta compression of data at
rest depending on how much data they need to store. Something the size of npm
would certainly benefit from this, while other registries may be just fine
storing all objects literally. Installer clients will typically only ever
download a tiny fraction of the total registry contents and so delta compression
of data at rest is less of a concern here.
Delta compression is most attractive if it can reduce bandwidth and latency on
the network. This implies that either multiple releases are being sent together
and so their similar objects can be compressed, or that a registry is sending
objects to a client when it knows which objects the client already has. In the
latter case it can use an approach similar to Git's "thin packs" to
delta-compress transferred objects against those that the receiver already has,
and that are not being sent in this transfer. Whether the receiver expands these
deltas or retains them as-is is a separate question that we'd explore as part of
the storage design for registries and installers.
One major difference here is that I don't think a package distributor needs to
prioritise cheap reads for recent objects as Git does. Git's delta compressor
tends to store recent objects literally, and older objects as deltas. That's
because it wants commands that act on recent data to feel instantaneous, so
adding read latency would make a big difference. In package management, we are
usually downloading a newer release of a package when we may already have some
older releases. We'd like to send this newer release over the network as a delta
against something the receiver already has. It would be fine to store this new
release as a delta, as the time spent reading files will usually be a small
fraction of the time an installer spends doing its work. The added latency of
resolving a delta will be barely noticeable in many installers. However, the
distributor should also be free to renormalise the content is has stored, and
this operation should be opaque to the installer.
The main problem with delta compression is that it's computationally expensive.
Identifying exact duplicate objects is very cheap as it just requires spotting
identical IDs, whereas delta compression requires processing of the object
contents, and possibly trying out multiple combinations of objects into delta
pairs to find the best results. (Git uses a fixed-size sliding window over a
sorted list of objects and looks for good delta results between nearby objects,
so that this process runs in linear rather than quadratic time.)
Again, caching may provide a solution here, especially if we want to be
CDN-friendly. Even with edge computing, we only have very limited time to
perform on-the-fly work, and this likely rules out on-the-fly delta compression.
However, it may be possible at publish or download time to identify common
upgrade paths and pre-generate deltas for those releases. For example, if many
people are running foo@3 and foo@4 is released, then pre-generating a delta
from 3 to 4 would shrink the download for a lot of users. Such deltas could
also be used when sending bundles of releases of the same package between
registries -- a replicator could store the deltas it was sent and use them to
service installer requests.
For example, if an installer sends the request:
fetch foo
want: 5
have: 1, 2, 4

Then the registry could search its cache for a delta from 1 to 5, from 2
to 5, or from 4 to 5, send that delta if it exists, and send complete
objects otherwise. This path may be indirect, for example the registry may have
deltas from 2 to 3, and from 3 to 5, but sending both of those may still
be preferable to sending a complete copy of 5.
Dependencies and latency

All the above deals with trying to reduce the volume of data stored and
transferred, which is a major contribution to the cost of operating a registry
and a major source of latency for end users. Another source of end-user latency
is round trips performed while expanding dependencies.
Before it can get to requesting releases of packages, an installer needs to
determine which releases it even needs. To do this, it needs to take a set of
dependencies and convert it into a set of exact versions for releases it wants
to download. That is, given a set of the immediate dependencies of an
application and their acceptable version ranges, it must produce a set of exact
versions that meets those dependencies, including any transitive dependencies --
we'll call this set of versions the install set.
This set of versions may include multiple versions of the same package (c.f.
npm). Some installers will not support running multiple releases of the same
package in the same project, and so also need to constraint-solve across the
complete install set to find versions that are mutually compatible with
everything else in the install set. Having done this, an installer will often
write the install set to a lockfile, such as package-lock.json, Gemfile.lock
or Cargo.lock.
Most package managers do this by fetching the metadata for each direct
dependency. For example if our app depends on react, the package manager will
download the metadata for react (either the package or a specific version
thereof), and find out what react depends on. It does this recursively until
it's discovered the full set of transitive dependencies. This process may cache
package metadata locally, but it remains the responsibility of the client to
walk the dependency graph and generate the install set.
As we saw above, Git's fetch command does not do this. Bob does not download
commit J, discover its parent is H, download H, discover its parents as G and C,
download G, etc. Instead he sends a message to Alice saying "I want J and I
already have E", and Alice performs the graph search locally to determine what
to send.
For end users with slow or unreliable connections, these round trips are
extremely expensive, especially for large dependency trees as we tend to see in
npm. Even though the amount of data transferred is small compared to the package
contents, the request latency compounds with the depth of the dependency tree,
and constraint-solving may require fetching metadata for many versions of the
same package.
Existing approaches focus on making these metadata requests cheaper --
separating metadata from package content, bundling this metadata into indexes by
package or across the registry, aggressive client-side caching. Another approach
could be to move the problem away from the client altogether, and have the
registry compute the install set.
In the simplest design, a client would send its list of dependencies to a
registry, and the registry would send back the install set, and possibly the
content for the releases identified in the install set.
A possible optimisation to this would be that the client also sends a list of
versions it already has access to -- if it already has some releases that would
satisfy the dependencies, it could use those even if a more recent release
exists, and thereby skip downloading that package. However, as the client does
not know which packages will end up being identified in the install set, it may
have to send its complete list of installed versions across all packages, which
would make the request very large. This could be fixed by having the registry
send a set of acceptable versions for each package, rather than exact versions,
when it computes the install set. The client could then decide whether to use a
version it already has, or request a new one from the registry.
Proposed protocol flow

I suspect that the simplest design in terms of separation of concerns is that
the client works as follows:

send its dependencies to a registry, which returns an install set -- a list of
versions the client should fetch; these versions could point to content in
multiple registries
group this install set by the registry each version is stored in
send a request to each registry stating which versions it wants, and which
versions of the relevant package it already has; the registry then streams the
requested content, omitting redundant objects, and applying delta compression
where possible within in package

A nice property of this is that it separates the roles of the content store and
the metadata store -- the package content and dependency info could be stored in
different systems.
For example, say we have an app that depends on foo >= 2.1.3. To get all its
dependencies, we make a request to resolver.example.com with this list of
dependencies. The resolver does its work and sends us back this list of
versions:
npm.example.com/foo@2.2.0
npm.example.com/bar@1.0.0
entropic.example.com/qux@0.0.4

We'd then make two requests, one to npm.example.com and one to
entropic.example.com. To each we'd send the versions we want, and a list of
any versions we already have. For example if we already have bar@0.9.0:
To: npm.example.com
want: foo@2.2.0, bar@1.0.0
have: bar@0.9.0

npm.example.com uses this to construct the smallest pack it can -- omitting
objects that exist in bar@0.9.0, and applying delta compression of bar@1.0.0
against the objects in bar@0.9.0. (If multiple releases of bar were being
sent it could also apply delta compression among them.)
To entropic.example.com we send a request for qux@0.0.4, which is sent to us
in its entirety as we don't have any releases of qux stored locally.
This shows how the dependency resolver, and content stores for each package, can
be nicely separated while minimising the number of connections and round-trips
to each server.
In Git, we saw that Alice performing the graph search locally sped up Bob's
request for the missing commits. But this does not necessarily have to happen
locally, it just needs to happen with lower latency than an end user will
experience. This means the resolver could be an entirely separate service from
the source registries that contain package metadata, as long as it has a
low-latency connection to them. In fact this may be an entirely separate service
from registries per se -- just moving dependency resolution into some cloud
service rather than running on end user machines would improve their experience
significantly.