aaronlehmann/design.md Secret

## design.md

      
    Raw
  

              design.md
            
          
    Overview

Recent security bugs exposed some big deficiencies in the current manifest format and graph store. The work done for 1.8.3/1.9 focused on short-term fixes without changes to data structures, protocols, or on-disk formats. This addressed the immediate concerns, but the current data model is poorly suited to what we’re using it for, which is a maintenance burden and a potential source of future problems. Moving past 1.9, it’s important to move to cleanly content-addressable data structures that accurately represent the data.
The solution for 1.10 includes the following components:
New manifest format

The current manifest format has two big problems which contributed to the security issues. First, it is not truly content addressable, since the digest which identifies it is only taken over a portion of the manifest. Second, it includes a “v1compatibility” string for each FS layer. This ties the format to v1 identifiers, and a one-to-one mapping between layers and configurations; both of which are problematic.
Docker 1.10 adds new manifest format that corrects these problems. The manifest consists of a list of layers and a single configuration. The digest of the manifest is simply the hash of the serialized manifest. We add an image configuration object that completely covers both its configration and root filesystem, making it possible to use the hash of the configuration as a content addressable ID of the image.
Cleanliness and security are not the only motivations for introducing a new manifest format - it's also a prerequisite for multi-architecture support. The new manifest format includes a "manifest list" schema, which can specify a separate manifest for each architecture an image supports. Multi-architecture support probably won’t be ready for Docker 1.10, but including this feature in the new manifest version is much better than introducing yet another manifest format soon after.
For the lastest draft of the new manifest spec, see distribution/distribution#1068.
Fixing the relationship between images and layers

In current versions of Docker, each layer has a configuration associated with it. This isn’t the right data model, because a configuration should really apply to a set of layers. It should be allowed to have two configurations referencing a layer stack without duplicating the data.
This is a major change to how the graph store works, and how it’s stored on disk. Layers will be referenced by multiple images, so they must be reference counted. Layers will be stored on disk in a content-addressable way, instead of being stored by image ID (see below).
Improved content-addressability of layers

Layers are currently referenced by a sha256 digest of the compressed artifact for v2 pushes and pulls. This digest is not used elsewhere in the engine; for example, "docker build" and v1 pulls do not produce content-addressable layers. In the new layer storage mechanism, we always address layers and images by their uncompressed content. This allows us to reproducibly get the same data back.
Improve push/pull reliability

We will attempt to make push and pull more reliable operations while updating this code.
Design choices

Content hashes vs. distribution hashes

One of the key decisions underlying these changes is where to use hashes of compressed data ("distribution artifacts") and where to use hashes of uncompressed data ("content hashes"). It's useful to have hashes of the distribution artifacts, because that makes it possible to download layers without requiring the registry to keep a mapping of content hashes to distribution artifact hashes. Also, it lets the client verify the files it downloaded without first uncompressing them (decompressing a file can be a risky operation).
However, the distribution artifacts may change from push to push, because we don't specify a fully deterministic approach to compression. In practice, feeding the data into the compressor in different increments may cause differences in distribution artifacts. Changes in Go's gzip libraries can also affect the hash stability of distribution artifacts. Working around this reliably would require storing a duplicate copy of each layer in the gzipped format - there is no way to reproduce a gzip pulled from an unknown origin.
Another very important consideration is that hashing a gzipped artifact requires the compression to happen in the first place. If we used distribution artifact hashes everywhere, docker build would have to compress the filesystem content involved in the build, in order to build a content-addressable layer. This would significantly slow down docker build.
Finally, distribution artifact hashes are not suitable for docker save and docker load, because these features work with uncompressed tars.
For these reasons, we chose to use content hashes and distribution hashes in the respective places where they make most sense. Image IDs are generated based on content hashes, so that docker build doesn't need to compress the filesystem, and so that image IDs don't change from push to push depending on the output of the compressor. However, the new manifest format will include a list of distribution artifact hashes alongside the reference to the image configuration. This list of hashes is essentially a set of instructions on what to download, and then the downloads can be verified against the content hashes in the image configuration after decompression.
One tradeoff involved with this is that pull-by-digest must use a different hash namespace than the image ID namespace. The image IDs are based purely on content hashes, but manifests include distribution artifact hashes as well. It would have been nice to unify these two identifiers, but the issues with hash stability and need for compression discussed above mean there would be major downsides to going that route. Note that this separation would probably be needed anyway for multi-arch manifest lists that are generated separately and then combined later.
Hashing the tar stream

The section above discussed distribution artifact hashes and content hashes. Here, we dive deeper into how those hashes are computed.
A distribution artifact hash is simply a hash of the compressed blob, exactly as it's transmitted between the client and server. But what about the content hash? Deciding how to compute this is less obvious. We originally considered an approach that would create a Merkel DAG of directories, files, and their associated content data and metadata, by reading this data and metadata out of the graph driver.
We quickly found that this approach would not work reliably. Various graph drivers support different feature sets - for example, some do not support extended attributes. Metadata would be lost or mutated depending on the particular graph driver in use, causing hash instability. The only way to work around this would be to store all the metadata outside the graph driver, and refer to it when computing hashes.
However, storing original metadata is exactly what tarsplit does. Rather than reimplement a clone of tarsplit, content hashes are created by hashing an uncompressed tar stream. tarsplit restores the metadata as it existed in the original tar, which means that graph driver bugs and missing features are not issues as far as hash stability. And even if we implemented a similar system, we would still need to use tarsplit, to be compatible with old versions and/or to keep the regenerated artifacts consistent.
Hashing the uncompressed tar also has the advantage that we're hashing the exact content that gets imported into the layer store, rather than creating a parallel rendition which could have subtle differences. This gives us more confidence that we're hashing the right data. Finally, tarsplit is far simpler and easier to reason about than creating an alternative serialization format as hash function input.
We have implemented on-disk storage with a LayerStore interface that manages the creation on content addressable hashes internally. This means that the content hashing design is well-abstracted and can be changed in the future.
Backward compatibility

The on-disk format will not be compatible with that used by previous versions. There is a migration step to import old content from an existing "graph". This will change IDs to use the new content-addressable scheme, but existing data will not be lost.
Docker 1.10 will still be capable of pulling and pushing to and from v1 registries, and v2 registries that don't support the new manifest format. It will assign new content-addressable IDs to images pulled with those mechanisms.
Pushes to old registries will involve a one-time cache bust, since a synthetic set of parent images will be created to fill in the parent chain that's part of the data model of the old manifest. These parent image configurations and their IDs will be created with a deterministic algorithm, so that if two engines push the same thing to an old registry, they will be able to share layers.
Use of the new manifest format will be controlled by an Accept header. If the engine does not send an appropriate Accept header, the registry will assume it only supports the old format, and do on-the-fly rewrites of new manifests to the old format. This has the side effect that images pushed with the new manifest using content trust will only verify correctly to engines that can accept the new manifest format.
New image config

The image configuration JSON is what gets hashed to generate an image ID. There are some tweaks to the image structure to make it suitable for this.
rootfs

A new rootfs key references the layer content addresses used by the image. This makes the image config hash depend on the filesystem hash. rootfs has two subkeys:

type is currently always set to layers. Other types might be added in the future.
diff_ids is an array of layer content hashes, in order from bottom-most to top-most.

Here is an example rootfs section:
    "rootfs": {
      "diff_ids": [
        "sha256:c6f988f4874bb0add23a778f753c65efe992244e148a1d2ec2a8b664fb66bbd1",
        "sha256:5f70bf18a086007016e948b04aed3b82103a36bea41755b6cddfaf10ace3c6ef",
        "sha256:13f53e08df5a220ab6d13c58b2bf83a59cbdc2e04d0a3f041ddf4b0ba4112d49"
      ],
      "type": "layers"
    }

history

Since parent images are no longer created during a docker pull, the image configuration now contains history information. The history object is an array of objects with the following fields:

created: Creation time, expressed as a time.Time marshalled to JSON.
author: The author of the build point.
created_by: The command which created the layer.
comment: A custom message set when creating the image.
empty_layer: This field is used to mark if the history item created a layer, for maximum compatibility with the existing v2 manifest format. It is set to true if this history item doesn't correspond to an actual layer in the rootfs section (for example, a command like ENV which results in no change to the filesystem).

Here is an example history section:
    "history": [
      {
        "created": "2015-10-31T22:22:54.690851953Z",
        "created_by": "/bin/sh -c #(nop) ADD file:a3bc1e842b69636f9df5256c49c5374fb4eef1e281fe3f282c65fb853ee171c5 in /"
      },
      {
        "created": "2015-10-31T22:22:55.613815829Z",
        "created_by": "/bin/sh -c #(nop) CMD [\"sh\"]",
        "empty_layer": true
      }
    ]

Removed fields

The following fields are no longer used:

id (now computed by hashing the JSON)
parent (stored out-of-band)
Size (unneeded since layers are referenced in the rootfs section)

ID definitions and calculations

This table summarizes the different types of IDs involved and how they are calculated:


ID scheme
Meaning
Calculation


layer.DiffID
ID for an individual layer
DiffID = SHA256hex(uncompressed layer tar data)


layer.ChainID
ID for a layer and its parents. This ID uniquely identifies a filesystem composed of a set of layers.
For bottom layer: ChainID(layer0) = DiffID(layer0)


For other layers: ChainID(layerN) = SHA256hex(ChainID(layerN-1) + " " + DiffID(layerN))


image.ID
ID for an image. Since the image configuration references the layers the image uses, this ID incorporates the filesystem data and the rest of the image configuration.
SHA256hex(imageConfigJSON)


"V1 ID1"
legacy image/layer ID; originally not content-addressable
Calculated for schema1 manifest:


For top layer: V1ID(layerTOP) = SHA256hex(blobsum(TOP) + " " + V1ID(layerTOP-1) + " " + imageConfigJSON)


For other layers: V1ID(layerN) = SHA256hex(blobsum(layerN) + " " + V1ID(layerN-1))


Interfaces

The current graph package conflates several different concepts, and isn't well-abstracted or composable. The refactor will involve three major interfaces: the layer store, image store, and tag store.
Layer store

The layer store is the lowest-level component of the three. It manages filesystem layers in the underlying graph drivers. It does not know about image configuration.
Layers can be added to the layer store by passing a tar stream and a parent ID (see the Register) function. The return value is a layer ID ("ChainID") computed from the content and the content of the parent chain. Thus, IDs are always computed locally rather than trusting remotely-specified IDs. This ID takes into account the position in the layer chain, in contrast to the DiffID, which only identifies the data in a particular layer. The image configuration includes an array of DiffIDs instead of a single ID so it's possible to determine which individual layer artifacts need to be pushed or pulled. A ChainID can be computed from a set of DiffIDs using a one-way function.
The layer store reference counts layers so that multiple images can refer to overlapping sets of layers. When the last reference to a layer is released, the underlying data on disk is deleted.
package layer

// ChainID is the content-addressable ID of a layer.
type ChainID digest.Digest

// DiffID is the hash of an individual layer tar.
type DiffID digest.Digest

// TarStreamer represents an object which may
// have its contents exported as a tar stream.
type TarStreamer interface {
        TarStream() (io.Reader, error)
}

// Layer represents a read only layer
type Layer interface {
        TarStreamer
        ChainID() ChainID
        DiffID() DiffID
        Parent() Layer
        Size() (int64, error)
        DiffSize() (int64, error)
        Metadata() (map[string]string, error)
}

// RWLayer represents a layer which is
// read and writable
type RWLayer interface {
        TarStreamer
        Path() (string, error)
        Parent() Layer
        Size() (int64, error)
}

// Metadata holds information about a
// read only layer
type Metadata struct {
        // ChainID is the content hash of the layer
        ChainID ChainID

        // DiffID is the hash of the tar data used to
        // create the layer
        DiffID DiffID

        // Size is the size of the layer content
        Size int64

        // DiffSize is the size of the top layer
        DiffSize int64
}

// MountInit is a function to initialize a
// writable mount. Changes made here will
// not be included in the Tar stream of the
// RWLayer.
type MountInit func(root string) error

// Store represents a backend for managing both
// read-only and read-write layers.
type Store interface {
        Register(io.Reader, ChainID) (Layer, error)
        Get(ChainID) (Layer, error)
        Release(Layer) ([]Metadata, error)

        Mount(id string, parent ChainID, label string, init MountInit) (RWLayer, error)
        Unmount(id string) error
        DeleteMount(id string) ([]Metadata, error)
        Changes(id string) ([]archive.Change, error)
}
Image store

The image store manages image configurations. It holds references to the underlying layers. Images are content-addressed, and identified by the hashes of their configurations. Since the configurations include the digests of the content of each layer, this means an image ID depends on its configuration and all filesystem data.
package image

// ID is the content-addressable ID of an image.
type ID digest.Digest

// Store is an interface for creating and accessing images
type Store interface {
        Create(config []byte) (ID, error)
        Get(id ID) (*Image, error)
        Delete(id ID) ([]layer.Metadata, error)
        Search(partialID string) (ID, error)
        SetParent(id ID, parent ID) error
        GetParent(id ID) (ID, error)
        Children(id ID) []ID
        Map() map[ID]*Image
        Heads() map[ID]*Image
}
Tag store

The tag store keeps a mapping of tags and digests to image IDs. Note that the digests stored in the tag store are manifest digests, which are different from the hashes that generate image IDs.
package tag

// An Association is a tuple associating a reference with an image ID.
type Association struct {
        Ref     reference.Reference
        ImageID images.ID
}

// Store provides the set of methods which can operate on a tag store.
type Store interface {
        References(id images.ID) []reference.Named
        ReferencesByName(ref reference.Named) []Association
        Add(ref reference.Named, id images.ID, force bool) error
        Delete(ref reference.Named) (bool, error)
        Get(ref reference.Named) (images.ID, error)
}
Interoperability with old image formats

Docker 1.10 will still be able to push and pull from existing registries. This section explains how images are converted in and out of the new content-addressable model.
V2 registry (old manifest format)

The engine maintains a mapping between content-addressable layer IDs and compressed artifact blobsums. The mapping is bidirectional. For looking up a content artifact blobsum from a layer, the mapping is layer.DiffID -> []blobsum. For looking up a layer DiffID from an artifact blobsum, the mapping is blobsum -> layer.DiffID.
Pulling

When we pull something using the old manifest format, we first deduplicate adjacent layers from the manifest. Then we check the blobsums-to-layer-DiffID mappings for each layer, in order from bottom to top, so we can skip downloading layers we already have. If a particular blobsum isn't found in the mapping table, or the layer referred to by the set of DiffIDs so far doesn't exist on disk, we download all subsequent layers.
We download the needed layers, and register them in order. Each call to Register produces a layer DiffID that we associate with its blobsum in the mapping table for future reference. Note that layers with the throwaway key set to true in the v1compatibility object are treated as empty layers and not registered with the layer store or included in the rootfs section. These were included in the manifest only to preserve the history information for that layer.
Once all the layers are present, the image object can be created. The configuration is converted from the first image in the History list by removing the following fields:

id
parent
Size
parent_id
layer_id
throwaway

... and adding these new fields:

rootfs (a list of layer DiffIDs for the image)
history (image history generated from other configs in the History list)

Aside from using them to generate the history field in the image, the other V1Compatibility items in History are ignored. The pull only results in one image, not a runnable image for each layer. V1 IDs in the manifest are not preserved; instead the engine uses content-addressable image and layer IDs.
Pushing

Pushing also makes use of the mapping tables described above. If a DiffID is known to correspond to a certain artifact digest, and the registry already has a blob with that digest, we can skip pushing the layer.
We push the layers that need to be pushed, and then construct an appropriate manifest for the image. The manifest contains a runnable configuration for the top image, and synthetic configurations for the layers below it. These synthetic configurations don't result in runnable images, so if an older engine pulls something pushed by 1.10, it can run the image it pulled, but not any of the parent images.
The synthetic configurations for lower layers only have an id field, a parent fields (for all layers except the bottom-most), and fields involved in storing history information. Their V1 IDs are generated using a hash chain:
V1ID(layerN) = SHA256hex(blobsumN + " " + V1ID(layerN-1))

This ensures that V1 IDs will not collide between manifests that have different filesystem contents or different layer order.
The top image configuration is created from the 1.10 image configuration by removing the rootfs and history keys, and adding appropriate id and parent keys. The v1 ID for the top-level image is generated by extending the hash chain used for the other layers:
V1ID(layerTOP) = SHA256(blobsumTOP + " " + V1ID(layerTOP-1) + " " + imageConfigJSON)

Including the image config JSON in the V1 ID calculation for this top layer ensures that top layers in different manifest which have the same filesystem contents but different configurations won't have colliding IDs.
If there are any empty layers in the image, those get entries in the manifest so that their history information is not lost. However, a special throwaway key is added to their v1compatibility string to mark them as empty layers, so the image comes out the same way after pulling it back.
After pushing layers, the mappings between layer DiffIDs and content artifact digests are updated. This means future pushes can skip layers that were already uploaded, and future pulls can skip downloading layers that already exist on disk.
V1 registry

The strategy with the V1 registry is similar, but the V1 ID calculation scheme is different because there are no blob digests available in the V1 protocol.
Instead of keeping a mapping between blob digests and layer IDs, we keep a unidirectional mapping from (registry, V1 ID) tuples to layer IDs.
Pulling

For each layer V1 ID involved in a pull, the engine first consults the v1 ID mapping to see if it already has the layer. If it doesn't, it downloads each missing layer in order and registers it. The mapping is updated as necessary.
The image configuration is constructed using the same approach as v2 pull.
Pushing

When pushing, the engine makes a list of layers involved in the push operation, and precomputes V1 IDs for them.
Similarly to V2, top layers are treated differently from other layers. For top layers, we use the image ID is the V1 ID. The image ID is based on a hash over the image configuration (which references the filesystem layers by digest), so it uniquely identifies the entire image.
Other layers use the layer ChainID for the V1 ID. The layer ChainID is based on the hash of the filesystem data for that layer. Using the hash of only the filesystem data lets layers (other than the top layer) be shared between different images. Configuration differences are not a concern for these layers, because they always have synthetic, deterministically generated configurations. The synthetic configurations are generated using the same scheme as for V2 push (except that history is not currently preserved on V1 pushes).
ID scheme	Meaning	Calculation
`layer.DiffID`	ID for an individual layer	`DiffID = SHA256hex(uncompressed layer tar data)`
`layer.ChainID`	ID for a layer and its parents. This ID uniquely identifies a filesystem composed of a set of layers.	For bottom layer: `ChainID(layer0) = DiffID(layer0)`
	For other layers: `ChainID(layerN) = SHA256hex(ChainID(layerN-1) + " " + DiffID(layerN))`
`image.ID`	ID for an image. Since the image configuration references the layers the image uses, this ID incorporates the filesystem data and the rest of the image configuration.	`SHA256hex(imageConfigJSON)`
"V1 ID1"	legacy image/layer ID; originally not content-addressable	Calculated for schema1 manifest:
	For top layer: `V1ID(layerTOP) = SHA256hex(blobsum(TOP) + " " + V1ID(layerTOP-1) + " " + imageConfigJSON)`
	For other layers: `V1ID(layerN) = SHA256hex(blobsum(layerN) + " " + V1ID(layerN-1))`