Skip to content

Instantly share code, notes, and snippets.

@alfredkrohmer
Created March 19, 2017 22:25
Show Gist options
  • Save alfredkrohmer/5249ea4c88aab4c7bff1b34c955c1980 to your computer and use it in GitHub Desktop.
Save alfredkrohmer/5249ea4c88aab4c7bff1b34c955c1980 to your computer and use it in GitHub Desktop.
Proposal: Deduplicated storage and transfer of container images

Proposal: Deduplicated storage and transfer of container images

Docker uses layers to build containers and images on top of other images. This saves storage and transfer of layers already existing on the system. Rkt uses no layers and needs to fully download any new image.

This document proposes to use deduplication to store and transfer container images. This has multiple advantages over layering:

  • save even more space than with layering because files with same content will only be saved once – also when the files appear under different paths, in different layers or even in completely different images
  • only transfer the files that are not known at the destination already – this basically takes the layering approach one step further: the order of layers matters, with deduplication there is no fixed ordering
  • remove unneeded files and strip down container images: in Docker, each layer can only add or modify files, removed files are just masked are still transferred with the previous layers

Data Blocks

An image can be stored and transferred as a set of blocks. The block size can vary between 256 B and 128 KiB. The block size can also be infinite – each file is a block.

  • when using a deduplicating file system as the storage backend, it makes sense to use the same block size as the file system – this way, each block can be written as a file into an index directory and again as part of the target file while only having to store the block once on the file system
  • when not using a deduplicating file system as the storage backend, having one block per file makes more sense – each file can be written into an index directory and hard linked at the target path

The content of each block is hashed with a hash function, let's say SHA256.

Meta blocks

Meta blocks are blocks which contain metadata about directories, files, symlinks, etc. Each meta block consists of the following items:

  • type + mode
  • UID
  • GID
  • path
  • xattrs
  • ACLs
  • either of:
    • link target
    • device numbers
    • array of data block hashes + their offset in data index (see below)

Meta blocks are not subject to a maximum block size.

Block indices

Each image would store two additional index files:

  • a meta file which contains an array of meta block hashes
  • a data file which contains an array of data block hashes

Data blocks

Data blocks contain the actual data of files.

Block storage

Data blocks are stored in the blocks directory. The depth of this directory structure is configurable (even at runtime).

Example: A block has a SHA256 hash of e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.

The depth of the directory structure equals the number of bytes uses as directory names. With depth = 2, the block would be saved with the following path: e3/b0/c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.

Depending on the amount of blocks in the system, the depth value can easily be adjusted by creating appropriate directories, hardlinking files at the new locations and removing the old links and directories.

  • for client machines with a low number of stored images: depth = 1 (256 prefix folders = # of inodes max)
  • for registry machines with a high number of stored images: depth = 2 (65536 prefix folders = # of inodes max)

Pulling an image

  1. Download the meta index file from the image registry.
  2. Build a list of meta block hashes (and their offset in the meta file) which are not yet in the system. Group continuous offsets into ranges.
  3. Send the list of offset values and ranges to the registry and retrieve all missing meta blocks as a continuous stream.
  4. Iterate over all data blocks given in the meta blocks and build a list of data block hashes (and their offsets in the data file) that are not yet in the system. Group continuous offsets into ranges.
  5. Send the list of offset values and ranges to the registry and retrieve all missing data blocks as a continuous stream.

Every block in the stream is prefixed with a uint32 value declaring its size.

  • When using a deduplicating file system, write each block to a file in the local blocks directory named after the block hash. Write the block again to the target file. Maintain a system-wide on-disk hash map of existing block hashes. Even when not using a deduplicating file system, using this strategy can be favorable because in this way, every machine can easily act as a deduplicating image registry.
  • When not using a deduplicating file system, write whole files to the local files directory named after the file hash (= hash over all block hashes of a file). Create a hard link at the target file path. Maintain a system-wide on-disk hash map of existing file hashes. When checking whether a file is already in the system, just hash all block hashes given in the corresponding meta block.

Pushing an image

  1. Create the meta and data index files locally.
  2. Upload the meta index file. The registry replies with a list of offsets of meta block hashes which are unknown.
  3. Send all unknown meta blocks to the registry as a continuous stream. The registry replies with a list of offsets of data block hashes wich are unknown.
  4. Send all unknown data block to the registry as a continuous stream.

Every block in the stream is prefixed with a uint32 value declaring its size.

As the registry does not need to have the actual directory and file structure on disk, apply the strategy for deduplicating file systems (see Pulling an image) when saving received images.

Layering

Layering like in Docker can still be supported easily. For appending files, just append to the meta and data index files of previous layers. When modifying (or deleting) files, add the corresponding meta blocks and delete the old ones.

Savings

Examples are for the corresponding Docker images. Block size: 4096 B

Image size¹ (MiB) deduped size¹ # data blocks index size (KiB) meta blocks usage (KiB) # meta blocks² index size (KiB)
alpine:3.5 3.80 3.78 1011 31.59 48.85 446 13.94
ubuntu1204 98.79 95.96 28369 886.53 1375.19 8377 261.78
centos:7 190.90 180.60 49631 1550.97 2398.76 10518 328.69
individual 293.49 280.34 79011 2469.09 3822.80 19341 604.41
combined 293.49 274.03 77015 2406.72 3806.68 16481 515.03
gain 2.3 % 2.5 % 0.4 % 14.8 %

¹ regular files ² equals the number of unique directories, files, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment