Created
May 6, 2014 03:08
-
-
Save akiradeveloper/505857264eb9fe5f7fd3 to your computer and use it in GitHub Desktop.
This is the updated document for Writeboost. Please make a comment if you notice mistakes on grammars or know more fluent expression.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Writeboost | |
========== | |
Writeboost target provides block-level log-structured caching. | |
Accepted bios are put into a big "log" and the log is written to the cache | |
device sequentially. | |
Mechanism | |
========= | |
Writeboost caches only writes - reads are not cached. | |
However, this doesn't necessarily mean that it doesn't improve read performance | |
of the whole system. | |
For most of the storage systems, writes are more burdening than reads. | |
(cf. RAID penalty) | |
If the write load of the the backing device gets low then it improves the read | |
performance as the backing device can focus on processing reads. | |
There are two mechanism to reduce the write load of the backing device: | |
1. Writeboost can cut the writes to the backing device by processing them on the | |
cache device | |
2. In Writeboost's writeback, the data are sorted by the destination address and | |
then submitted in async manner. Therefore, the average write load of the | |
backing is always lower compared to being without Writeboost. | |
Additionally, the write data cached which are typically what written back from | |
the page cache are likely to be hit again on read. Needless to say, this also | |
improves the read. | |
For these reasons, Writeboost can improve not only writes but also reads. | |
The lifetime of the NAND SSD as the cache device is as great concern as the | |
performance in real world operations. Caching on read | |
1. shortens the lifetime of the cache device | |
2. sometimes takes no effect because of the data duplication between page cache. | |
As for the performance and the lifetime of the cache device, | |
Writeboost doesn't stage on read and therefore the value of Writeboost is the | |
optimized operation as a pure write cache. | |
Basic Mechanism | |
--------------- | |
Writeboost controls three different layers - RAM buffer (rambuf), cache device | |
(cache_dev, e.g SSD) and backing device (backing_dev, e.g. HDD). | |
Write data are first stored in the RAM buffer and when the buffer is full | |
Writeboost adds metadata block to the RAM buffer to create a "log". | |
Afterward, the log is written to the cache device in background processing in | |
sequential manner and thereafter written back to the backing device in | |
background. | |
Persistent Logging | |
------------------ | |
Writeboost can extend its functionality by specifying "type" in initialization. | |
Type 0 provides only the basic mechanism and the type 1 provides additional | |
"Persistent Logging" (or plog). | |
Plog aims to reduce the penalty in FLUSH operation by storing the write data to | |
both RAM buffer and persistent device (plog_dev). | |
This extended functionality is similar to full-data journaling in filesystem. | |
As of now, only block device as plog_dev is supported but other medium to use | |
will be supported in the future. | |
Log Replay | |
---------- | |
On reboot, Writeboost replays the logs written on the cache device to restore | |
the on-memory metadata. | |
Logs are chronologically ordered thus it is theoritically possible to restoring | |
the state of the storage system of the moment of your choice. | |
Processings | |
=========== | |
Writeboost is consist of one foreground processing and other six background | |
processings. | |
Foreground Processing | |
--------------------- | |
A bio is accepted and the driver does as the result of looking up the cache. | |
All write data are stored in the RAM buffer. Later, when the buffer is full, a | |
log is created and queued as a flush job. | |
Background Processings | |
---------------------- | |
(1) Flusher Daemon | |
This daemon dequeues a flush job from the queue and writes the log to the cache | |
device. | |
(2) Migrate Daemon | |
This daemon writes back the dirty data on the cache device to the backing | |
device. Writeboost calls writeback "Migration". | |
If `allow_migrate" is true, then it never starts writeback unless imminent | |
situation. Here, imminent situation is such that there is no room to append any | |
logs without writes back some segment to clean them up. | |
There are two major optimizations in writeback: | |
1. Multiple segments are written back at a time . `nr_max_batched_migration` is | |
the maximum number of segment to write back at a time. | |
2. The blocks to write back are sorted by the destination address on the backing | |
device. | |
(3) Migration Modulator | |
Writeback should be suppressed when the backing device is in high-load. | |
This daemon surveils the load of the backing device and stops writeback in | |
high-load by turning `allow_migrate` to false. | |
This daemon only enables when `enable_migration_modulator` is true and the | |
threshold to turn on/off the switch is determined by `migrate_threshold`. | |
(4) Superblock Recorder | |
This daemon periodically (specified by `update_record_interval`) records on | |
super block the last segment ID that was written back. | |
By doing this can omit unnecessary restoring in log replay and thus shorten the | |
reboot time. | |
(5) Sync Daemon | |
The data on the RAM buffer is lost in case of power failure. | |
Additionally, the data on the RAM cache of the cache device (typically, SSD has | |
such small cache) are also lost in such failure. | |
This daemon flushes them all periodically. (specified by `sync_interval`) | |
(6) Barrier Deadline (enabled type 0 only) | |
Without Persistent Logging, flush operation is high-penalty. It sometimes | |
results in making a log that is not fulfilled. | |
To mitigate this penalty, Writeboost has an optimization that delays ack to such | |
operation at most `barrier_deadline_ms` (ms). | |
By doing this, the log can be fulfilled in case of multiple processes shares the | |
storage and then submits writes. | |
Target Interfaces | |
================= | |
Use dmsetup command for operations. | |
Initialization (Constructor) | |
---------------------------- | |
<type> | |
<essential args> | |
<#optional args> <optional args> | |
<#tunable args> <tunable args> | |
- For <type>, see `Mechanism` | |
- <essential args> differs by <type> | |
- <optional args> and <tunable args> are unordered list of kv pairs. | |
type 0 (applied to all <type>): | |
<essential args> | |
backing_dev: A block device having original data (E.g. HDD) | |
cache_dev: A block device having caches (E.g. SSD) | |
<optional_args> | |
segment_size_order : Determines the size of a RAM buffer. | |
RAM buffer size will be 1 << n (sector). | |
4 <= n <= 10 | |
default 10 | |
nr_rambuf_pool : The number of RAM buffers to allocate | |
default 8 | |
<tunable args> | |
see `Messages` | |
E.g. | |
dmsetup create wbdev --table "0 $sz writeboost 0 $BACKING $CACHE" | |
dmsetup create wbdev --table "0 $sz writeboost 0 $BACKING $CACHE \ | |
4 nr_rambuf_pool 32 segment_size_order 8 \ | |
2 allow_migrate 1" | |
dmsetup create wbdev --table "0 $sz writeboost 0 $BACKING $CACHE \ | |
0 \ | |
2 allow_migrate 1" | |
type 1: | |
<essential args> | |
backing_dev | |
cache_dev | |
plog_dev_desc : A string descriptor to specify the plog device | |
E.g. | |
dmsetup create wbdev --table "0 $sz 0 writeboost 1 $BACKING $CACHE $PLOG" | |
Initialization (Reformatting) | |
----------------------------- | |
The cache device and plog are triggered reformating only if the first one sector | |
of the cache device is zeroed out. | |
Messages | |
-------- | |
Some behavior of Writeboost device can be tuned online. | |
Use dmsetup message for this purpose. | |
(1) Tunables | |
The tunables in constructor can be altered online. | |
See `Background processings` for detail. | |
barrier_deadline_ms (ms) | |
Default: 10 | |
allow_migrate (bool) | |
default: 0 | |
enable_migration_modulator (bool) and migrate_threshold (%) | |
default: 0 and 70 | |
nr_max_batched_migration | |
default: 1 << (15 - segment_size_order) | |
update_record_interval (sec) | |
default: 0 | |
sync_interval (sec) | |
default: 0 | |
E.g. | |
dmsetup message wbdev 0 enable_migration_modulator 0 | |
(2) その他 | |
clear_stats | |
Clear the statistic info (see `Status`). | |
drop_caches | |
Waits for all dirty data on the cache device to be written back to the backing | |
device. | |
E.g. | |
dmsetup message wbdev 0 drop_caches | |
Status | |
------ | |
<cursor pos> | |
<#cache blocks> | |
<#segments> | |
<current id> | |
<lastly flushed id> | |
<lastly migrated id> | |
<#dirty cache blocks> | |
<stat (w/r) x (hit/miss) x (on buffer?) x (fullsize?)> | |
<#none-full flushed> | |
<#tunable args> <tunable args> |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment