alfredkrohmer/README.md

## README.md

      
    Raw
  

              README.md
            
          
    Proposal: Deduplicated storage and transfer of container images

Docker uses layers to build containers and images on top of other images. This
saves storage and transfer of layers already existing on the system. Rkt uses no
layers and needs to fully download any new image.
This document proposes to use deduplication to store and transfer container
images. This has multiple advantages over layering:

save even more space than with layering because files with same content will
only be saved once – also when the files appear under different paths, in
different layers or even in completely different images
only transfer the files that are not known at the destination already – this
basically takes the layering approach one step further: the order of layers
matters, with deduplication there is no fixed ordering
remove unneeded files and strip down container images: in Docker, each layer
can only add or modify files, removed files are just masked are still
transferred with the previous layers

Data Blocks

An image can be stored and transferred as a set of blocks. The block size can
vary between 256 B and 128 KiB. The block size can also be infinite – each
file is a block.

when using a deduplicating file system as the storage backend, it makes sense
to use the same block size as the file system – this way, each block can be
written as a file into an index directory and again as part of the target file
while only having to store the block once on the file system
when not using a deduplicating file system as the storage backend, having one
block per file makes more sense – each file can be written into an index
directory and hard linked at the target path

The content of each block is hashed with a hash function, let's say SHA256.
Meta blocks

Meta blocks are blocks which contain metadata about directories, files,
symlinks, etc. Each meta block consists of the following items:

type + mode
UID
GID
path
xattrs
ACLs
either of:

link target
device numbers
array of data block hashes + their offset in data index (see below)


Meta blocks are not subject to a maximum block size.
Block indices

Each image would store two additional index files:

a meta file which contains an array of meta block hashes
a data file which contains an array of data block hashes

Data blocks

Data blocks contain the actual data of files.
Block storage

Data blocks are stored in the blocks directory. The depth of this directory
structure is configurable (even at runtime).
Example: A block has a SHA256 hash of
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.
The depth of the directory structure equals the number of bytes uses as
directory names. With depth = 2, the block would be saved with the following
path: e3/b0/c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855.
Depending on the amount of blocks in the system, the depth value can easily be
adjusted by creating appropriate directories, hardlinking files at the new
locations and removing the old links and directories.

for client machines with a low number of stored images: depth = 1 (256
prefix folders = # of inodes max)
for registry machines with a high number of stored images: depth = 2 (65536
prefix folders = # of inodes max)

Pulling an image


Download the meta index file from the image registry.
Build a list of meta block hashes (and their offset in the meta file) which
are not yet in the system. Group continuous offsets into ranges.
Send the list of offset values and ranges to the registry and retrieve all
missing meta blocks as a continuous stream.
Iterate over all data blocks given in the meta blocks and build a list of
data block hashes (and their offsets in the data file) that are not yet in
the system. Group continuous offsets into ranges.
Send the list of offset values and ranges to the registry and retrieve all
missing data blocks as a continuous stream.

Every block in the stream is prefixed with a uint32 value declaring its size.

When using a deduplicating file system, write each block to a file in the
local blocks directory named after the block hash. Write the block again to
the target file. Maintain a system-wide on-disk hash map of existing block
hashes. Even when not using a deduplicating file system, using this strategy
can be favorable because in this way, every machine can easily act as a
deduplicating image registry.
When not using a deduplicating file system, write whole files to the local
files directory named after the file hash (= hash over all block hashes of a
file). Create a hard link at the target file path. Maintain a system-wide
on-disk hash map of existing file hashes. When checking whether a file is
already in the system, just hash all block hashes given in the corresponding
meta block.

Pushing an image


Create the meta and data index files locally.
Upload the meta index file. The registry replies with a list of offsets of
meta block hashes which are unknown.
Send all unknown meta blocks to the registry as a continuous stream. The
registry replies with a list of offsets of data block hashes wich are
unknown.
Send all unknown data block to the registry as a continuous stream.

Every block in the stream is prefixed with a uint32 value declaring its size.
As the registry does not need to have the actual directory and file structure on
disk, apply the strategy for deduplicating file systems (see Pulling an image)
when saving received images.
Layering

Layering like in Docker can still be supported easily. For appending files, just
append to the meta and data index files of previous layers. When modifying
(or deleting) files, add the corresponding meta blocks and delete the old ones.
Savings

Examples are for the corresponding Docker images. Block size: 4096 B


Image
size¹ (MiB)
deduped size¹
# data blocks
index size (KiB)
meta blocks usage (KiB)
# meta blocks²
index size (KiB)


alpine:3.5
3.80
3.78
1011
31.59
48.85
446
13.94


ubuntu1204
98.79
95.96
28369
886.53
1375.19
8377
261.78


centos:7
190.90
180.60
49631
1550.97
2398.76
10518
328.69


individual
293.49
280.34
79011
2469.09
3822.80
19341
604.41


combined
293.49
274.03
77015
2406.72
3806.68
16481
515.03


gain

2.3 %
2.5 %

0.4 %
14.8 %


¹ regular files
² equals the number of unique directories, files, etc.
Image	size¹ (MiB)	deduped size¹	# data blocks	index size (KiB)	meta blocks usage (KiB)	# meta blocks²	index size (KiB)
alpine:3.5	3.80	3.78	1011	31.59	48.85	446	13.94
ubuntu1204	98.79	95.96	28369	886.53	1375.19	8377	261.78
centos:7	190.90	180.60	49631	1550.97	2398.76	10518	328.69
individual	293.49	280.34	79011	2469.09	3822.80	19341	604.41
combined	293.49	274.03	77015	2406.72	3806.68	16481	515.03
gain		2.3 %	2.5 %		0.4 %	14.8 %