Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?

Package distribution

In June 2019, the Entropic project was announced. It aims to provide a federated package distribution system for arbitrary packages, focussing initially on JavaScript. Its storage model uses some ideas from Git, and since working heavily on Git throughout 2018 I have been thinking about how its ideas could transfer to package management.

This document outlines the ideas I've been thinking about in the hopes that they're useful to the development and direction of Entropic. I expect that the project founders will already be aware of some of this, and I'm also aware that some of it may not be applicable or may be out of scope.

Motivation

Since learning about how Git transfers content over the network, it became clear to me that most existing package managers could make big improvements to their current approach to two problems: resolving dependencies, and transferring package content. Using storage models from Git, we could significantly reduce the size of packages on the wire and on disk. Copying Git's protocol for fetching commits could also be used to reduce the latency of dependency resolvers.

Adopting these ideas could improve the install time UX for developers with large dependency trees, and users on slow or unreliable network connections.

I'm also interested in the mirroring story for package repositories. I'm an attendee of /dev/fort, which involves a group of devs and designers going offline for a week to build something together. We try to mirror various software repos -- apt, cpan, pypi, rubygems, npm -- and some are a lot more amenable to mirroring than others. Barriers to mirroring include a lack of stable tools, ad-hoc protocols, and the sheer size of some repositories. I am interested in separating the problem of package distribution from package installation to reduce the burden for this kind of work.

Background

The following proposals make use of several ideas from Git. Here we'll describe those elements of Git that are necessary to understand them.

Data model

Git's data model includes four principal types of object:

  • blob: a buffer of arbitrary data; essentially, the contents of a file
  • tree: a non-empty set of uniquely named pointers to blobs and other trees
  • commit: a pointer to a tree, zero or more pointers to commits, plus metadata
  • tag: a pointer to a commit, plus metadata

Every object has a unique ID derived from a cryptographic hash of its content, and is therefore immutable (changing the content would change the ID). Git uses SHA-1 but is working on migrating to a newer hash function.

Commits form a directed acyclic graph (DAG) that describes the causal history of the project. We'll render such graphs as dots connected by lines, where the graph's edges always point from right to left, for example:

                  o---o
                 /
o---o---o---o---o-------o---o
     \       \             /
      o---o---o---o---o---o---o
               \     /
                o---o

Any lines leading to the left from a commit identify that commit's parents.

Every commit points to a tree that contains the complete project state at that commit. These trees are Merkle trees, a persistent data structure such that content that is unchanged from one commit to the next is not duplicated on disk; multiple commits can share the same version of a file by pointing at the same on-disk blob object, rather than by copying it.

For example, the tree of one commit may contain pointers to two files:

f8d58297 hello.txt
5367fa75 world.txt

If the subsequent commit only modifies hello.txt, then its tree will have a different pointer for hello.txt but use the same one for world.txt

a2e3af82 hello.txt
5367fa75 world.txt

Thus both trees share the world.txt blob and it is not duplicated on disk. This idea extends to trees, so that entire unchanged directories are shared between commits.

Object storage

Git stores these objects in one of two ways. Its "loose" format stores each object in its own file, with its data compressed using zlib. This is highly space-inefficient for small text files as each file ends up well under the minimum filesystem block allocation size (e.g. 4 KiB on macOS).

The other storage format is the "packed" format in which a set of objects are all stored in the same file; these files use the .pack extension and essentially contain a set of objects concatenated end-to-end. This format is used both to store objects on disk and to send them over the network. When on disk, an accompanying .idx file contains an index recording the byte offset in the .pack at which each object begins; this allows fast random access into a file that would otherwise need to be read from the beginning to locate any of its contents.

Each object in a pack is still compressed using zlib, but an additional compression method is employed. If two objects are very similar, as is the case when commits make small edits to existing files, then Git will designate one of the pair as the "base" and represent the other as a binary diff against the base object. Objects so-compressed are called "deltas" and are typically over 90% smaller than the original object.

For example, given this pair of strings:

source: the quick brown fox jumps over the slow lazy dog
target: a swift auburn fox jumps over three dormant hounds

It is possible to represent the target string as the following series of instructions for rebuilding it from the source string:

insert "a swift aubur"
copy   14 19
insert "ree dormant hounds"

For files that have long substrings in common, the delta replaces those common substrings with "copy" instructions, which identify the offset and length of a common string and are only a few bytes in size.

This compression method greatly reduces the total size of a set of objects both on disk and over the network, at the cost of making reading these objects more expensive. To read a delta object, we must first read its base object, and then apply the delta to that base to reconstruct the original. That base object may itself be stored as a delta.

To limit this increased read expense, Git imposes a limit on the length of such delta chains when they are created, and only stores an object as a delta if it reduces its size by at least 50%. Delta compression typically compresses an object a lot (> 90%) or very little (< 10%), so in practice this threshold means Git retains deltas that produce very large reductions in size.

Network transfers

The fetch and push commands transfer objects over the network; they essentially copy portions of the commit graph between repositories. For example, say that Alice's repository contains the following commit graph:

      F   G
      o---o
     /     \
o---o---o---o---o
A   B  C \  H   J
          \
           o---o
           D   E

Bob's repository contains a subset of this graph:

o---o---o
A   B  C \
          \
           o---o
           D   E

When Bob fetches from Alice, the Git protocol first tells Bob that Alice has commits E and J -- the tips of the branches in that graph. Bob has E so he ignores that, but he does not have J. Therefore he sends a message to Alice saying he wants J but already has E; Bob saying he has commit E implies that he has all commits that are ancestors of E. From this information, Alice performs a local graph search to determine Bob must be missing commits F, G, H and J, and sends only those commits to Bob. The commits and any new trees and blobs they introduce are rolled up into a pack, which Bob streams directly to disk. Delta compression is performed on this pack as Alice generates it, but as this is a slow process it may reuse existing delta objects from her local pack files to save time.

This protocol saves time by making the "server" an active decision-maker rather than a static content repository. If Bob had to fetch just J, and then see that he did not have its parent H, and so fetch that, and so on, the protocol would require a lot more round-trips. When you consider that these commits point to graphs of tree and blob objects, and the server can quickly determine which objects the client is missing and roll all of them into a single pack, this greatly reduces the time that the protocol spends on network round-trips. It also requires the client to send very little data to the server to complete its request.

Thin packs

To save even more network bandwidth, Git can use what it calls "thin" packs. Packs stored on disk must be self-contained; any object that's used as a base for a delta must appear in the same pack as the delta. But when sending data over the network, the sender can use knowledge of what the receiver already has to avoid sending any complete base objects.

Above, Alice knows Bob has commits A, B, C, D and E, and so can use any content from them as the base for a delta in the pack she sends, without including those base objects in the pack. (In practice only B and C will be considered as they are direct parents of the commits being sent; Git calls these the "edge" of the selected range.) When Bob receives this pack, he must append any missing base objects from his own database so that the pack becomes self-contained.

Pack layout

In pack files, objects are either stored as self-contained full blob/tree/commit/tag objects, or as deltas -- a pointer to a base object, plus a binary diff that regenerates the original object from the base. The method Git uses to select objects as delta pairs, and the order in which it stores them, are informed by the likely access patterns when data is accessed on disk and sent over the network.

Storing an object as a delta saves a lot of space but makes it more expensive to read. Rather than reading an object's data straight off disk, Git must first load its base object (which may itself be a delta), and then apply the binary diff to regenerate the original object. Git limits the length of delta chains to prevent this becoming too expensive, but it also sorts objects in a particular way to reduce the read-time cost.

When working locally, Git commands more frequently access "recent" objects than older ones -- e.g. the status and diff commands compare the workspace to the current HEAD, the log command lists recent commits first. "Recent" here strictly means closely connected to the tips of branches, and in some situations Git will sort objects according to the history graph.

However, the pack compressor, in order to find pairs of objects that will form good delta pairs, sorts them by size. This is primarily because objects that are most similar to each other will tend to have similar sizes, but it's also because size is a good proxy for age: files tend to get larger over time.

Git sorts the objects largest-first, so that deltas usually have a larger object as their base. This means that recent objects tend to be stored literally, not as deltas, making them cheap to access. It also means that deltas tend to have very little inline data of their own -- they mostly consist of pointers to chunks of content from a larger object, and are extremely small as a result.

Finally, Git also orders objects in packs such that recent objects (in the graph connection sense, not by size or clock time) appear near the start of the file, making exceptions for objects that are bases that need to appear before deltas based on them. This further improves access times for recent objects on hardware where seeks are expensive.

Proposals

What follows is a list of ways in which we could use the above ideas to improve package distribution. I'm particularly interested in separating package distribution from package installation; transferring files between machines should be language-agnostic and separable from whatever it is an installer, compiler or other build tool wishes to do with those files.

That includes concerns like where files are stored and how they are laid out on disk; the package storage and distribution infrastructure should be free to chose how it stores packages, and an installer should be able to request file content without being concerned with the details of the storage design.

Within reason, package installers should be able to work offline. If an application's dependencies can be satisfied by packages that are already cached locally, and the user cannot connect to a registry, then the installer should use the cached content. See cargo --offline for an example. This means the distribution infrastructure should store enough information about packages locally to determine what they depend on, and to fully reconstruct their content.

Storage of content must be concurrency-safe. It is normal that multiple instances of a package installer run concurrently on a single machine, especially for build and CI tasks. If we reuse distribution infra for multiple installers, this will become more common. Concurrent requests to the distribution infrastructure that result in additions to locally cached content should be serialisable -- they should have the same result as if all requests were made sequentially. It may be acceptable for the physical representation of the content to differ, but not its logical interpretation. For example, the order requests are made in may change how the registry batches up content into different packs, but those packs viewed in total must represent the same logical set of objects once all requests are complete.

Terminology

I will be using the following definitions:

  • package: a conceptual namespace that refers to a library and its codebase across all its releases, for example, react is a package
  • release: the content comprising a specific version of a package, that is, some particular tree of files, plus some metadata e.g. a package.json file; this tree and metadata is a logical structure, not a physical one, and is independent of the manner in which the release is stored on disk or sent over a network
  • version: a name that refers to a specific release, usually made up of a package name and a version number, e.g. react@2.1.3
  • dependency: an expression denoting a set of versions, usually made up of a package name and a semver expression that identifies a set of versions of that package; in package.json this might be expressed as "react": ">=2.1.3 <3.0.0"

I have deliberately kept the term dependency high-level here. Most package managers overload dependencies with other kinds of properties, for example:

  • optional: a dependency is not strictly required for the application to build
  • situational: a dependency is required only in some environment, for example to run the application's tests
  • relational: this describes a dependency's position relative to another dependency in a materialised tree of releases in systems where install layout matters, for example npm's peerDependencies

The last of these is a concern of the package installer, not necessarily the distributor. The others are less clear to me at this time. From the point of view of trying separate distribution from installation, it will be clear below that some dependency information is necessary for distribution.

The simplest solution I can imagine is that a release has a "default" dependency set, and possibly other named sets where those names have meaning to installers. For example an installer requests the default and test dependency sets, and that information is given to the dependency resolver to produce the full set of required versions.

Storage and bandwidth

We can improve the spatial efficiency of distributing packages, both on disk and over the network. Most package managers store each release of a package as a physically distinct artefact, for example each release is stored as a self-contained tarball or other compressed archive format.

Releases of the same package will tend to have a lot of content in common -- they are snapshots of the state of a single codebase, usually. Especially for patch releases, many of the files in a package may be identical from release to release, while others may change only slightly. This mirrors the storage characteristics of version control, where most content does not change from commit to commit. This means it's amenable to the use of Merkle trees and delta compression to save space on disk and on the wire.

Within a single package, we can in principle realise substantial space savings by moving to a Merkle-tree form of storage. This would de-duplicate files that do not change their content across multiple releases. I believe Entropic is already using such a file-centric storage model, rather than a release-centric one. In this model, a release becomes a pointer to a particular tree, plus some metadata, analogous to a Git commit.

However, releases are not connected in a graph like commits are; a package's releases are an unordered set, even though the versions that refer to the releases may form a partial order. No particular causal relationship is assumed between releases.

Merkle trees and delta compression provide different benefits, and do so in different circumstances. Making best use of them requires consideration of how releases are stored on disk and how they are sent over the network.

Merkle trees

Merkle trees let us de-duplicate files and directories that are unchanged between releases. Whenever two or more releases are stored together, a Merkle tree lets them share storage space. This is true regardless of how the releases are physically serialised to disk.

It is also true if the set of stored releases is updated incrementally. If we are sent a release that contains a file that exists in the releases we already have, then we don't need to store a copy of it. This means that even if each release is sent in its entirety over the network, any files or directories they have in common will be stored only once on disk, rather than multiple times as they would be under a tarball storage model.

On the network, we consider two distinct situations: two releases of the same package being sent over the network together, and a single release being sent to a receiver that already has another release of the same package.

When two or more releases are sent together, the sender can analyse their content and generate a set of the unique files/trees across all the releases, and thus send each object only once. As the receiver should be storing each distinct object only once, this serialisation should be easy to map to disk storage.

When sending releases to a receiver that already has some releases of the same package, it may be possible to use knowledge of what the sender already has to avoid sending redundant objects. In the Git example above, say a file exists in commit B and remains unchanged in F, G, H and J. Since Alice knows Bob has commit B, she does not need to send this file, even though it appears in the commits she's sending, because Bob already has it. Git uses this technique to reduce the number of objects sent; any objects that exist in the "edge" commits are not sent over the network.

In Git, commits are connected in a graph. The receiver requests the commits they want and states some commits they already have, and the sender works out which commits in the graph the receiver is missing. The sender generates a unique set of files and trees for these commits, and any objects present in "edge" commits are omitted. In this workflow, the inputs are a set of commits, and the output is a range of commits between the inputs, plus their associated content.

Releases are not connected in this way, and the data a receiver is requesting is often different. In fact there are two kinds of client to consider.

Say that a registry exists that contains a package foo, with versions 1 through 6. One kind of client is a replicator: another registry that wants to mirror the complete contents of the source registry. If a replicator already has foo@1, foo@2 and foo@4, it likely wants all the releases it's missing, and so its request would take the form:

fetch foo
want: *
have: 1, 2, 4

Because releases are not connected in a graph, the client must enumerate all the releases it has if it wants to minimise the data the source sends it; possession of foo@4 does not imply possession of any other release so the sender should not assume the receiver has any release it does not explicitly mention.

The sender responds to this request by enumerating any releases the sender does not have: 3, 5 and 6. It generates their unique content objects, omitting any objects present in releases the receiver does have, and sends the resulting set.

The other kind of client is an installer: it wants a specific release of a package in order to meet a dependency (direct or transitive) of an application. So its request can still include any versions it already has, but will mention a specific version it wants:

fetch foo
want: 5
have: 1, 2, 4

In this case the registry should send only version 5, but it can use the fact that the client has 1, 2 and 4 to avoid sending redundant objects.

However, both these cases introduce a performance problem: the registry will use the list of have versions to avoid sending objects the receiver already has. The cost of doing this scales with the number of have versions the client sends, which will likely monotonically increase with time. It may be possible to avoid scanning each release completely by stopping at tree IDs that have already been visited, but to a first approximation the cost scales with the number of have versions.

In Git this is not the case: only "edge" commits are considered when removing redundant objects. In our example above:

      F   G
      o---o
     /     \
o---o---o---o---o
A   B  C \  H   J
          \
           o---o
           D   E

Bob requests commit J saying he already has E, so the missing range is the commits F, G, H, J. The "edge" commits are any immediate parents of those commits that are not selected, that is, B and C.

When Bob sends his request, he does not need to mention every commit he has; just mentioning he has E implies he has A, B, C and D. The set of edge commits does not scale either with the number of commits Bob says he has, or with the size of the range he's missing. It scales with how much the graph forks and merges, not how long the selected chain is. That means the cost of removing redundant objects does not necessarily increase with time.

What does this mean for our package manager? One answer could be that the contents of releases, and the mapping of versions to releases, are immutable, and so all computations over sets of releases are eminently cacheable. However we may want to avoid caching as the example requests given above have combinatorial complexity of which versions a client may request and already have.

Another possible solution is to observe that removal of redundant objects from network transfers is a performance optimisation, not a correctness requirement, and so it may be acceptable to receive some objects we already have. So, clients may be required to limit the number of have versions they send. Some strategy would be needed to select a subset of versions the client has that are most likely to minimise the size of the download. For example, two patch releases are likely to have a lot of objects in common and so their marginal information content is low; whereas, major versions will differ more and so cover off more objects.

Or, in the case of installer clients, we could prioritise versions that are close to the requested version, as they will have the most content in common and therefore have the greatest chance of shrinking the download size.

Delta compression

Delta compression is somewhat harder to exploit. Merkle trees, with the implied identification of objects by their contents, means objects are trivially deduplicated across releases by comparing their IDs. Exploiting delta compression requires some analysis of objects' content to identify similar objects and then replace the originals with deltas.

Registries and replicators may be interested in delta compression of data at rest depending on how much data they need to store. Something the size of npm would certainly benefit from this, while other registries may be just fine storing all objects literally. Installer clients will typically only ever download a tiny fraction of the total registry contents and so delta compression of data at rest is less of a concern here.

Delta compression is most attractive if it can reduce bandwidth and latency on the network. This implies that either multiple releases are being sent together and so their similar objects can be compressed, or that a registry is sending objects to a client when it knows which objects the client already has. In the latter case it can use an approach similar to Git's "thin packs" to delta-compress transferred objects against those that the receiver already has, and that are not being sent in this transfer. Whether the receiver expands these deltas or retains them as-is is a separate question that we'd explore as part of the storage design for registries and installers.

One major difference here is that I don't think a package distributor needs to prioritise cheap reads for recent objects as Git does. Git's delta compressor tends to store recent objects literally, and older objects as deltas. That's because it wants commands that act on recent data to feel instantaneous, so adding read latency would make a big difference. In package management, we are usually downloading a newer release of a package when we may already have some older releases. We'd like to send this newer release over the network as a delta against something the receiver already has. It would be fine to store this new release as a delta, as the time spent reading files will usually be a small fraction of the time an installer spends doing its work. The added latency of resolving a delta will be barely noticeable in many installers. However, the distributor should also be free to renormalise the content is has stored, and this operation should be opaque to the installer.

The main problem with delta compression is that it's computationally expensive. Identifying exact duplicate objects is very cheap as it just requires spotting identical IDs, whereas delta compression requires processing of the object contents, and possibly trying out multiple combinations of objects into delta pairs to find the best results. (Git uses a fixed-size sliding window over a sorted list of objects and looks for good delta results between nearby objects, so that this process runs in linear rather than quadratic time.)

Again, caching may provide a solution here, especially if we want to be CDN-friendly. Even with edge computing, we only have very limited time to perform on-the-fly work, and this likely rules out on-the-fly delta compression. However, it may be possible at publish or download time to identify common upgrade paths and pre-generate deltas for those releases. For example, if many people are running foo@3 and foo@4 is released, then pre-generating a delta from 3 to 4 would shrink the download for a lot of users. Such deltas could also be used when sending bundles of releases of the same package between registries -- a replicator could store the deltas it was sent and use them to service installer requests.

For example, if an installer sends the request:

fetch foo
want: 5
have: 1, 2, 4

Then the registry could search its cache for a delta from 1 to 5, from 2 to 5, or from 4 to 5, send that delta if it exists, and send complete objects otherwise. This path may be indirect, for example the registry may have deltas from 2 to 3, and from 3 to 5, but sending both of those may still be preferable to sending a complete copy of 5.

Dependencies and latency

All the above deals with trying to reduce the volume of data stored and transferred, which is a major contribution to the cost of operating a registry and a major source of latency for end users. Another source of end-user latency is round trips performed while expanding dependencies.

Before it can get to requesting releases of packages, an installer needs to determine which releases it even needs. To do this, it needs to take a set of dependencies and convert it into a set of exact versions for releases it wants to download. That is, given a set of the immediate dependencies of an application and their acceptable version ranges, it must produce a set of exact versions that meets those dependencies, including any transitive dependencies -- we'll call this set of versions the install set.

This set of versions may include multiple versions of the same package (c.f. npm). Some installers will not support running multiple releases of the same package in the same project, and so also need to constraint-solve across the complete install set to find versions that are mutually compatible with everything else in the install set. Having done this, an installer will often write the install set to a lockfile, such as package-lock.json, Gemfile.lock or Cargo.lock.

Most package managers do this by fetching the metadata for each direct dependency. For example if our app depends on react, the package manager will download the metadata for react (either the package or a specific version thereof), and find out what react depends on. It does this recursively until it's discovered the full set of transitive dependencies. This process may cache package metadata locally, but it remains the responsibility of the client to walk the dependency graph and generate the install set.

As we saw above, Git's fetch command does not do this. Bob does not download commit J, discover its parent is H, download H, discover its parents as G and C, download G, etc. Instead he sends a message to Alice saying "I want J and I already have E", and Alice performs the graph search locally to determine what to send.

For end users with slow or unreliable connections, these round trips are extremely expensive, especially for large dependency trees as we tend to see in npm. Even though the amount of data transferred is small compared to the package contents, the request latency compounds with the depth of the dependency tree, and constraint-solving may require fetching metadata for many versions of the same package.

Existing approaches focus on making these metadata requests cheaper -- separating metadata from package content, bundling this metadata into indexes by package or across the registry, aggressive client-side caching. Another approach could be to move the problem away from the client altogether, and have the registry compute the install set.

In the simplest design, a client would send its list of dependencies to a registry, and the registry would send back the install set, and possibly the content for the releases identified in the install set.

A possible optimisation to this would be that the client also sends a list of versions it already has access to -- if it already has some releases that would satisfy the dependencies, it could use those even if a more recent release exists, and thereby skip downloading that package. However, as the client does not know which packages will end up being identified in the install set, it may have to send its complete list of installed versions across all packages, which would make the request very large. This could be fixed by having the registry send a set of acceptable versions for each package, rather than exact versions, when it computes the install set. The client could then decide whether to use a version it already has, or request a new one from the registry.

Proposed protocol flow

I suspect that the simplest design in terms of separation of concerns is that the client works as follows:

  • send its dependencies to a registry, which returns an install set -- a list of versions the client should fetch; these versions could point to content in multiple registries
  • group this install set by the registry each version is stored in
  • send a request to each registry stating which versions it wants, and which versions of the relevant package it already has; the registry then streams the requested content, omitting redundant objects, and applying delta compression where possible within in package

A nice property of this is that it separates the roles of the content store and the metadata store -- the package content and dependency info could be stored in different systems.

For example, say we have an app that depends on foo >= 2.1.3. To get all its dependencies, we make a request to resolver.example.com with this list of dependencies. The resolver does its work and sends us back this list of versions:

npm.example.com/foo@2.2.0
npm.example.com/bar@1.0.0
entropic.example.com/qux@0.0.4

We'd then make two requests, one to npm.example.com and one to entropic.example.com. To each we'd send the versions we want, and a list of any versions we already have. For example if we already have bar@0.9.0:

To: npm.example.com
want: foo@2.2.0, bar@1.0.0
have: bar@0.9.0

npm.example.com uses this to construct the smallest pack it can -- omitting objects that exist in bar@0.9.0, and applying delta compression of bar@1.0.0 against the objects in bar@0.9.0. (If multiple releases of bar were being sent it could also apply delta compression among them.)

To entropic.example.com we send a request for qux@0.0.4, which is sent to us in its entirety as we don't have any releases of qux stored locally.

This shows how the dependency resolver, and content stores for each package, can be nicely separated while minimising the number of connections and round-trips to each server.

In Git, we saw that Alice performing the graph search locally sped up Bob's request for the missing commits. But this does not necessarily have to happen locally, it just needs to happen with lower latency than an end user will experience. This means the resolver could be an entirely separate service from the source registries that contain package metadata, as long as it has a low-latency connection to them. In fact this may be an entirely separate service from registries per se -- just moving dependency resolution into some cloud service rather than running on end user machines would improve their experience significantly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment