Skip to content

Instantly share code, notes, and snippets.

@mapk0y
Last active March 14, 2016 05:10
Show Gist options
  • Save mapk0y/18063c6b13a6d362281d to your computer and use it in GitHub Desktop.
Save mapk0y/18063c6b13a6d362281d to your computer and use it in GitHub Desktop.
docker strage driver についてのメモ

内容

Docker Storage Drivers を読んでのメモです。

aufs

p31

With O_WRONLY or O_RDWR - write access look it up in the top branch; if it's found here, open it

otherwise, look it up in the other branches; if we find it, copy it to the read-write (top) branch, then open the copy

That "copy-up" operation can take a while if the file is big!

p33

The AUFS mountpoint for a container is /var/lib/docker/aufs/mnt/$CONTAINER_ID/

It is only mounted when the container is running

The AUFS branches (read-only and read-write) are in /var/lib/docker/aufs/diff/$CONTAINER_OR_IMAGE_ID/

All writes go to /var/lib/docker

p34 Under the hood

To see details about an AUFS mount:

look for its internal ID in /proc/mounts

look in /sys/fs/aufs/si_.../br*

each branch (except the two top ones) translates to an image

aufs 使ってる時 sysfs 見てなかったから参考になる。

p36 Performance, tuning

Read/write access has native speeds

But initial open() is expensive in two scenarios: when writing big files (log files, databases ...) with many layers + many directories in PATH (dynamic loading, anyone?)

When starting the same container 1000x, the data is loaded only once from disk, and cached only once in memory (but dentries will be duplicated)

aufs がパフォーマンス上問題となるのは "open" 時のみということか。
キャッシュはいいとして、dentries の重複に関しては頭の片隅に。

device mapper

p40

The mountpoint for a container is /var/lib/docker/devicemapper/mnt/$CONTAINER_ID/

It is only mounted when the container is running

The data is stored in two files, "data" and "metadata" (More on this later)

Since we are working on the block level, there is not much visibility on the diffs between images and containers

p41 Under the hood

docker info will tell you about the state of the pool (used/available space)

List devices with dmsetup ls

Device names are prefixed with docker-MAJ:MIN-INO

MAJ, MIN, and INO are derived from the block major, block minor, and inode number where the Docker data is located (to avoid conflict when > >>running multiple Docker instances, e.g. with Docker-in-Docker) Get more info about them with dmsetup info, dmsetup status (you shouldn't need this, unless the system is badly borked)

Snapshots have an internal numeric ID

/var/lib/docker/devicemapper/metadata/$CONTAINER_OR_IMAGE_ID is a small JSON file tracking the snapshot ID and its size

後から確認する

p42

When there are no more blocks in the pool, attempts to write will stall until the pool is increased (or the write operation aborted)

これが centos などで刺さると言われる原因かな?(aufs/overlay ではなったことがない)

p43

and sparse file performance isn't great anyway

ふむ。dm のほうが多少マシだと思ってたが容量の問題もあり sparse file だからそこが色々ネックになってそうだ

p44 tuning

docker -d --storage-opt dm.datadev=/dev/sdb1 --storage-opt dm.metadatadev=/dev/sdc1

ないな

btrfs

読む前メモ: CoreOS もやめるみたいだしパフォーマンスには期待できないだろう。流し読み

p48

BTRFS integrates the snapshot and block pool management features at the filesystem level, instead of the block device level

dm より有利か?

p49

Data is not written directly, it goes to the journal first (in some circumstances1, this will affect performance)

The performance will be half of the "native" performance

知らんかった。頭の片隅に

p50

# btrfs filesys balance start -dusage=1 /var/lib/docker

chunk 周りの話と "No space left on device" って言われた時の回避方法について
btrfs よく知らないからあまり意味がわかってない

p51 Performance, tuning

Not much to tune.

# btrfs filesys show

Overlay(fs)

読む前メモ: おそらく本命。Overlay(3.18~) と Overlayfs (~3.17 Ubuntu あたりの独自パッチ) あたりは注意が必要。使えるのは前者のみ

p56 Under the hood

Images and containers are materialized under /var/lib/docker/overlay/$ID_OF_CONTAINER_OR_IMAGE

Images just have a root subdirectory (containing the root FS)

Containers have:

lower-id → file containing the ID of the image

merged/ → mount point for the container (when running)

upper/ → read-write layer for the container

work/ → temporary space used for atomic copy-up

p57 Performance, tuning

identical files are hardlinked between images

Not much to tune at this point

Performance should be slightly better than AUFS: no stat() explosion good memory use slow copy-up, still (nobody's perfect)

ふむ~。aufs よりパフォーマンス本当にいいのか?

vfs

p59

No copy on write. Docker does a full copy each time!

Space inefficient, slow

ないな。てか使ってる人いるのかな。

p60

Might be useful for production setups

少しわかるけど、それなら docker 使わないきがする。

おまけ

Discard と Trim の話

p67 Trim

Trim のごくごく一般的な話。

Also meaningful on copy-on-write storage (if/when every snapshots as trimmed a block, it can be freed)

CoW だとブロックの書き換えがないから Trim 使いやすそう

p68 discard

discard = fs の(mount 時の)オプション

fstrim コマンドはじめ知った

p69 The discard quandary

discard works on Device Mapper + loopback devices

... but is particularly slow on loopback devices (the loopback file needs to be "re-sparsified" after container or image deletion, and this is a slow operation)

You can turn it on or off depending on your preference

dm で有効になっているかあとから見る

EOF

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment