Proposal: Deduplicated storage and transfer of container images
Docker uses layers to build containers and images on top of other images. This saves storage and transfer of layers already existing on the system. Rkt uses no layers and needs to fully download any new image.
This document proposes to use deduplication to store and transfer container images. This has multiple advantages over layering:
- save even more space than with layering because files with same content will only be saved once – also when the files appear under different paths, in different layers or even in completely different images
- only transfer the files that are not known at the destination already – this basically takes the layering approach one step further: the order of layers matters, with deduplication there is no fixed ordering
- remove unneeded files and strip down container images: in Docker, each layer can only add or modify files, removed files are just masked are still transferred with the previous layers
Data Blocks
An image can be stored and transferred as a set of blocks. The block size can
vary between 256 B and 128 KiB. The block size can also be infinite
– each
file is a block.
- when using a deduplicating file system as the storage backend, it makes sense to use the same block size as the file system – this way, each block can be written as a file into an index directory and again as part of the target file while only having to store the block once on the file system
- when not using a deduplicating file system as the storage backend, having one block per file makes more sense – each file can be written into an index directory and hard linked at the target path
The content of each block is hashed with a hash function, let's say SHA256.
Meta blocks
Meta blocks are blocks which contain metadata about directories, files, symlinks, etc. Each meta block consists of the following items:
- type + mode
- UID
- GID
- path
- xattrs
- ACLs
- either of:
- link target
- device numbers
- array of data block hashes + their offset in
data
index (see below)
Meta blocks are not subject to a maximum block size.
Block indices
Each image would store two additional index files:
- a
meta
file which contains an array of meta block hashes - a
data
file which contains an array of data block hashes
Data blocks
Data blocks contain the actual data of files.
Block storage
Data blocks are stored in the blocks
directory. The depth of this directory
structure is configurable (even at runtime).
Example: A block has a SHA256 hash of
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
.
The depth of the directory structure equals the number of bytes uses as
directory names. With depth = 2
, the block would be saved with the following
path: e3/b0/c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
.
Depending on the amount of blocks in the system, the depth
value can easily be
adjusted by creating appropriate directories, hardlinking files at the new
locations and removing the old links and directories.
- for client machines with a low number of stored images:
depth = 1
(256 prefix folders = # of inodes max) - for registry machines with a high number of stored images:
depth = 2
(65536 prefix folders = # of inodes max)
Pulling an image
- Download the
meta
index file from the image registry. - Build a list of meta block hashes (and their offset in the
meta
file) which are not yet in the system. Group continuous offsets into ranges. - Send the list of offset values and ranges to the registry and retrieve all missing meta blocks as a continuous stream.
- Iterate over all data blocks given in the meta blocks and build a list of
data block hashes (and their offsets in the
data
file) that are not yet in the system. Group continuous offsets into ranges. - Send the list of offset values and ranges to the registry and retrieve all missing data blocks as a continuous stream.
Every block in the stream is prefixed with a uint32
value declaring its size.
- When using a deduplicating file system, write each block to a file in the
local
blocks
directory named after the block hash. Write the block again to the target file. Maintain a system-wide on-disk hash map of existing block hashes. Even when not using a deduplicating file system, using this strategy can be favorable because in this way, every machine can easily act as a deduplicating image registry. - When not using a deduplicating file system, write whole files to the local
files
directory named after the file hash (= hash over all block hashes of a file). Create a hard link at the target file path. Maintain a system-wide on-disk hash map of existing file hashes. When checking whether a file is already in the system, just hash all block hashes given in the corresponding meta block.
Pushing an image
- Create the
meta
anddata
index files locally. - Upload the
meta
index file. The registry replies with a list of offsets of meta block hashes which are unknown. - Send all unknown meta blocks to the registry as a continuous stream. The registry replies with a list of offsets of data block hashes wich are unknown.
- Send all unknown data block to the registry as a continuous stream.
Every block in the stream is prefixed with a uint32
value declaring its size.
As the registry does not need to have the actual directory and file structure on disk, apply the strategy for deduplicating file systems (see Pulling an image) when saving received images.
Layering
Layering like in Docker can still be supported easily. For appending files, just
append to the meta
and data
index files of previous layers. When modifying
(or deleting) files, add the corresponding meta blocks and delete the old ones.
Savings
Examples are for the corresponding Docker images. Block size: 4096 B
Image | size¹ (MiB) | deduped size¹ | # data blocks | index size (KiB) | meta blocks usage (KiB) | # meta blocks² | index size (KiB) |
---|---|---|---|---|---|---|---|
alpine:3.5 | 3.80 | 3.78 | 1011 | 31.59 | 48.85 | 446 | 13.94 |
ubuntu1204 | 98.79 | 95.96 | 28369 | 886.53 | 1375.19 | 8377 | 261.78 |
centos:7 | 190.90 | 180.60 | 49631 | 1550.97 | 2398.76 | 10518 | 328.69 |
individual | 293.49 | 280.34 | 79011 | 2469.09 | 3822.80 | 19341 | 604.41 |
combined | 293.49 | 274.03 | 77015 | 2406.72 | 3806.68 | 16481 | 515.03 |
gain | 2.3 % | 2.5 % | 0.4 % | 14.8 % |
¹ regular files ² equals the number of unique directories, files, etc.