mikulely/bcache.org

## bcache.org

      
    Raw
  

              bcache.org
            
          
    bcache

bcache 实测对于小文件随机写场景下效果好，而且减少了小写的延迟。
目前 bcache 这类方案主要的问题是过于复杂，难以实现和维护。
Intro

引入块层的缓存对于优化磁盘的读写性能来说效果显著。块级别的缓存有多种实
  现，可以按照是否依赖 device mapper 机制分为两类:
不依赖 dm 的实现:

  bcache
  enchanceio

依赖于 dm 的实现:

  flashcache (flashcache 的主推者是 facebook，成熟度最高)
  dm-cache

重要的特性:

  支持 SSD 池化管理
    
      支持 thin-provisioning,从 SSD 池中划分出纯 SSD 的卷单独使用
      不像其他方案 metadata 空间固定，索引更新很容易导致该区域写坏(wear out)
      单个 cache device 可以带多个 backing device
      backing device 可以在运行时 attach/detache
    
  
  IO sensitivity (REQ_SYNC/REQ_META/REQ_FLUSH/REQ_FUA)
  Barriers/cache flushes are handled correctly.
  支持发现并 bypass 顺序读写
  SSD 友好(支持 COW,减少写放大，保持顺序性)
  支持在运行时修改 cache mode

Usage

format device

单个 cache set 中可以有若干个 cache device 和若干个 backing device。
格式化出 backing device:
# make-bcache -B /dev/sdx1

格式化出 cache device:
# make-bcache --block 4k --bucket 2M -C /dev/sdy2

注意 block size 是指 backing device 的扇区大小,而 bucket size 则是
  cache device 也就是 SSD 的 erase block size。合理配置这两个值可以降低
  写放大问题。
attach device

完成格式化之后，接下来进行关联，首先找到 cache device 的 cache set
  uuid:
# bcache-super-show /dev/sdy2 | grep cset.uuid

然后进行 attach:
# echo cset.uuid > /sys/block/bcache0/bcache/attach

然后设置 cache mode:
# echo writeback > /sys/block/bcache0/bcache/cache_mode

注意控制 bcache 设备有两个入口:

  /sys/block/bcache<N>/bcache
  sys/fs//bcache/<cset-uuid>

cache status

检查 cache 状态:
# cat /sys/block/bcache0/bcache/state

一种 4 种状态:

  no cache - 表示你没有将 cache device 和 backing device 关联起来
  clean - 没有 dirty data
  dirty - 表示启用了 writeback，并且有 dirty data
  inconsistent - 出问题了，backing device 和 cache device 不一致了

writeback control

writeback_percent 如果为非零值，bcache 将会按照它指定的比例来对
  writeback 做流控，并使用 PD controller 来平滑的调整比例。
下面的命令将会把所有的 dirty data 都刷到 backing device:
echo 0 > /sys/block/bcache0/bcache/writeback_percent

writeback_delay 控制新写入 cache device 的数据多久才会被 writeback:
cat  /sys/block/bcache0/bcache/writeback_delay
30

sequential bypass

大顺序写 HDD 性能也不错，进行 bypass:
cat  /sys/block/bcache0/bcache/sequential_cutoff
4.0M

大于 4M 的顺序写就不需要缓存了。

  I personally like to take that down to 1MB judging by the fact that
    files larger than 1MB do read pretty fast directly from the disk !

bypass 大顺序写对 SSD 来说也是有意义:
large write reduces internal GCs

gc control


  元数据 GC - 遍历 btree，根据 bkey 信息标记出无效缓存和有效缓存
    (dirty/clean 的数据),以及元数据,进行清理
  缓存数据 GC(Move GC) - 根据元数据 GC 阶段遍历的标记，找到包含较多无
    效缓存数据的多个 bucket，将其中数据移动到新 bucket 去

手动触发 GC:
echo 1 > /sys/fs/bcache/34e3bd83-cdab-466d-84d6-780f7c0d538f/internal/trigger_gc

find cache/backing device of bcache device

root@ip-172-31-30-23:~# lsblk
NAME      MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
xvda      202:0    0    8G  0 disk <- cache
└─xvda1   202:1    0    8G  0 part /
xvdb      202:16   0   20G  0 disk
└─bcache0 251:0    0  500G  0 disk
xvdd      202:48   0  500G  0 disk <- backing
└─bcache0 251:0    0  500G  0 disk

bcache 的 backing device:
root@ip-172-31-30-23:~# ls -l /sys/block/bcache0/bcache
lrwxrwxrwx 1 root root 0 Nov 24 10:37 /sys/block/bcache0/bcache -> ../../../vbd-51760/block/xvdd/bcache

bcache 的 cache device:
root@ip-172-31-30-23:~# ls ls -l /sys/block/bcache0/bcache/cache
ls: cannot access 'ls': No such file or directory
lrwxrwxrwx 1 root root 0 Nov 24 10:41 /sys/block/bcache0/bcache/cache -> ../../../../../fs/bcache/34e3bd83-cdab-466d-84d6-780f7c0d538f

cache set statistics

/sys/fs/bcache/<cset-uuid> 下:

  average_key_size - btree 中 data per key 的平均大小

root@ip-172-31-30-23:~# cat /sys/fs/bcache/34e3bd83-cdab-466d-84d6-780f7c0d538f/average_key_size
4.4k


  bdev<0..n> - 每个 attached backing device 的符号链接

root@ip-172-31-30-23:~# ls -l  /sys/fs/bcache/34e3bd83-cdab-466d-84d6-780f7c0d538f/bdev0
lrwxrwxrwx 1 root root 0 Nov 24 11:21 /sys/fs/bcache/34e3bd83-cdab-466d-84d6-780f7c0d538f/bdev0 -> ../../../devices/vbd-51760/block/xvdd/bcache


  block_size - cache device 的 block 大小

root@ip-172-31-30-23:~# cat /sys/fs/bcache/34e3bd83-cdab-466d-84d6-780f7c0d538f/block_size
4.0k


  btree_cache_size - btree 占用的内存大小

root@ip-172-31-30-23:~# cat /sys/fs/bcache/34e3bd83-cdab-466d-84d6-780f7c0d538f/btree_cache_size
108.5M


  bucket_size

root@ip-172-31-30-23:~# cat /sys/fs/bcache/34e3bd83-cdab-466d-84d6-780f7c0d538f/bucket_size
2.0M


  cache<0..n> - cache set 中 cache device 的符号链接

root@ip-172-31-30-23:~# ls -l  /sys/fs/bcache/34e3bd83-cdab-466d-84d6-780f7c0d538f/cache0
lrwxrwxrwx 1 root root 0 Nov 24 11:21 /sys/fs/bcache/34e3bd83-cdab-466d-84d6-780f7c0d538f/cache0 -> ../../../devices/vbd-51728/block/xvdb/bcache


  cache_available_percent - cache device 中不包含 dirty data 的百分比

root@ip-172-31-30-23:~# cat /sys/fs/bcache/34e3bd83-cdab-466d-84d6-780f7c0d538f/cache_available_percent
28

disable cache

root@ip-172-31-30-23:~# cat /sys/fs/bcache/34e3bd83-cdab-466d-84d6-780f7c0d538f/io_error_halflife
0
root@ip-172-31-30-23:~# cat /sys/fs/bcache/34e3bd83-cdab-466d-84d6-780f7c0d538f/io_error_limit
8

backing device config

检查 cache mode:

  writethrough - cache/backing device 都写完才返回
  writeback
  writearound
  passthrough - cache device 挂掉，之后直写 backing device

# cat /sys/block/bcache0/bcache/cache_mode
[writethrough] writeback writearound none

backing device 在 cache device 上有多少脏数据:
root@ip-172-31-30-23:~# cat /sys/block/bcache0/bcache/dirty_data
6.8G

readahead 大小:
cat  /sys/block/bcache0/bcache/readahead
0.0k

backing device statistics

backing device 的使用统计路径如下:
/sys/devices/vbd-51760/block/xvdd/bcache/stats_five_minute/cache_miss_collisions
/sys/devices/vbd-51760/block/xvdd/bcache/stats_total/cache_miss_collisions
/sys/devices/vbd-51760/block/xvdd/bcache/stats_day/cache_miss_collisions
/sys/devices/vbd-51760/block/xvdd/bcache/stats_hour/cache_miss_collisions

指标:

  bypassed - cache bypass 的 IO 总量，包括 write&read
  cache_hits - bcache 看到的独立的 IO 命中，partial hit 也会被认为是 miss
  cache_misses
  cache_hit_ratio
  cache_bypass_hits
  cache_bypass_misses
  cache_miss_collisions - 记录发生正要将数据写入 cache(因为 cache miss)
    但数据却出现在了 cache 中的 race 出现的次数
  cache_readaheads - 预读发生的次数

时间维度

  stats_five_minute - 前五分钟
  stats_hour - 前一小时
  state_day - 前一天
  stats_total

来依次看样例输出:
oot@ip-172-31-30-23:~# cat /sys/fs/bcache/34e3bd83-cdab-466d-84d6-780f7c0d538f/stats_total/cache_hit
cache_hit_ratio  cache_hits
root@ip-172-31-30-23:~# cat /sys/fs/bcache/34e3bd83-cdab-466d-84d6-780f7c0d538f/stats_total/cache_hits
23440
root@ip-172-31-30-23:~# cat /sys/fs/bcache/34e3bd83-cdab-466d-84d6-780f7c0d538f/stats_total/cache_hit_ratio
99

root@ip-172-31-30-23:~# cat /sys/fs/bcache/34e3bd83-cdab-466d-84d6-780f7c0d538f/stats_total/cache_bypass_hits
145427
root@ip-172-31-30-23:~# cat /sys/fs/bcache/34e3bd83-cdab-466d-84d6-780f7c0d538f/stats_total/cache_bypass_misses
0

root@ip-172-31-30-23:~# cat /sys/fs/bcache/34e3bd83-cdab-466d-84d6-780f7c0d538f/stats_total/cache_miss_collisions
0

cleanup

停止 cache:
echo 1 > /sys/block/bcache0/bcache/stop

释放 backing device:
echo 1 > /sys/block/xvdf/bcache/set/stop

cat /sys/block/bcache0/bcache/state

Implement

bcache 将 cache device 按照 bucket_size 进行划分。bucket_size 通常
  512KB(和 SSD 擦除块大小一致)。除了 SB bucket 之外，所有的更新操作都可
  以通过 append 完成。
bucket 的核心成员:

  priority 16 位，作为优先级编号，每次 hit 增加，决定了 bucket 要不要被刷出
  generation 8 位决定 bucket 是否合法

bcache 的数据布局:

  Data Zone - COW allocator,在 bucket 中的连续 extent
  Metadata Zone - 对 extent 的 B+树索引(保存 HDD 上数据到 SSD 缓存数据的
    对应关系)

Allocator

invalidate bucket

bucket 内只进行追加分配，记录当前分配到哪个偏移,下次从当前记录位置之后
  分配。
bucket allocator

bucket 的分配原则:

  IO 连续性优先，即便 IO 来自不同的生产者
  相关性，将统一进程的数据放在相同的 bucket 内

Index(bucket 的管理)

bucket 统一通过哟 btree 管理。
bkey

整个 B+树将 B+树节点映射到单个 bucket,而 bucket 中则保存着一组 bkey。
  bkey 表示 cache device 中的数据与 backing device 数据之间的映射:
struct bkey {
 uint64_t	high;
 uint64_t	low;
 uint64_t	ptr[];
}
Lookup btree

通过 HDD id + IO 请求的 LBA 来查找缓存数据。
Metadata Memory Cache

为每个 btree bucket 申请一块连续内存作为元数据缓存。
Update btree

利用 Journal/WAL 加速 B+tree 的修改, 写完 journal 以及内存中的 B+tree 节点缓存
  后写 IO 就可以返回了。
Writeback

bcache 为每个 backing device 启动单独的 flush 线程，将 SSD 中 dirty 数
  据刷到 HDD 中。

  通过 HDD id 获取 dirty 的 bkey,然后按照 bkey 中的 HDD LBA 信息排序(这样
    排序后的 bkey 依次读取 SSD 中脏数据写入 HDD 实现顺序落盘)
  由 LBA 记录 dirty bkey，从 SSD 读刷入 HDD

Writeback PD/PI Controller

流控:

  writeback PD controller(比例-微分控制器),水位越高，flush 速度越快
  让脏数据尽可能多的留在 cache 中
  更快速的改变 water level

PD/PI Controller https://www.spinics.net/lists/linux-bcache/msg04954.html
PD Controller
Proportional term
  Derivative term
bcache uses a control system to attempt to keep the amount of dirty data
  in cache at a user-configured level, while not responding excessively to
  transients and variations in write rate.  Previously, the system was a
  PD controller; but the output from it was integrated, turning the
  Proportional term into an Integral term, and turning the Derivative term
  into a crude Proportional term.  Performance of the controller has been
  uneven in production, and it has tended to respond slowly, oscillate,
  and overshoot.
PI controller
This patch set replaces the current control system with an explicit PI
  controller and tuning that should be correct for most hardware.  By
  default, it attempts to write at a rate that would retire 1/40th of the
  current excess blocks per second.  An integral term in turn works to
  remove steady state errors.
IMO, this yields benefits in simplicity (removing weighted average
  filtering, etc) and system performance.
Release History


  至少要在 4.8 以上版本的内核才比较稳定
  4.10 支持对 bcache 设备分区 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b8c0d911ac5285e6be8967713271a51bdc5a936a

Issue


  功能
    
      SSD 和 HDD 不支持热插拔
      HDD 卸载需要等脏数据全部刷完才可以
      HDD 损坏时，SSD 中脏数据无法清除
      不可靠的 write-back
        
          问题
            
              high error rate
              worn out
              FTL bugs
              destruction
            
          
          解决方案 1 - SSD RAID
            
              最小化 drity ratio (write-back -> write-back(dirty threshold)-> write through)
              SSD RAID as Cache(Failure Recovery via Parity)
                
                  high performance
                  high reliability
                  on-the-fly SSD replacement
                  flexible capacity Management
                
              
          解决方案 2 - Log-structured Approach
            
              最小化 parity 和 metadata update
            
          
  性能
    
      当 cache 被写满时，需要将脏数据都刷完才能继续为写 IO 提供缓存
      GC 运行时导致性能波动
      bcache 内存消耗大，系统内存不足时元数据无法缓存，从 SSD 读取
    
  
需要注意:

  如何创建多个 cache set
  选择一个稳定的内核版本(CentOS7.1 不支持 bcache)

References


  https://wiki.archlinux.org/index.php/Bcache
  http://www.sysnote.org/2014/05/29/bcache-use/ (bcache 使用)
  http://www.sysnote.org/2014/06/20/bcache-analysis/ (bcache 实现)
  https://people.redhat.com/mskinner/rhug/q1.2016/dm-cache.pdf
  使用 bcache 为 Ceph OSD 加速的具体实践 http://www.szsandstone.com/html/news/2017-3-16/209.html
  CAS https://www.intel.com/content/www/us/en/software/intel-cache-acceleration-software-performance.html
  https://raid6.com.au/posts/SSD_caching_problems/
  SSD as caching device: bcache, flashcache, enchanceio, btier
  bcache-status https://gist.github.com/damoxc/6267899
  使用 bcache 减少小写延迟 https://dl.gi.de/bitstream/handle/20.500.12116/893/45.pdf?sequence=1
  确保 cache device 的可靠 SSD RAID as Cache (SRC) with Log-structured Approach for Performance and Reliability
  简洁清晰的梳洗了 bcache https://datahunter.org/bcache
  讲了 Windows 平台下可以用 Intel Smart Response http://www.tech-g.com/2017/08/10/bcache-how-to-setup/
  最后列举了 bcache 的问题 http://confluence.wartungsfenster.de/display/Adminspace/bcache+SSD+testing+and+tuning
  https://github.com/torvalds/linux/blob/master/Documentation/bcache.txt
  bcachefs ANN https://lkml.org/lkml/2015/8/21/22

性能测试:

  bcache/enhanceio/dm-cache benchmark https://github.com/stec-inc/EnhanceIO/wiki/PERFORMANCE-COMPARISON-AMONG-dm-cache,-bcache-and-EnhanceIO

Readling List


  The Programmer’s Guide to bcache
  https://bcache.evilpiepirate.org/
  https://evilpiepirate.org/git/linux-bcache.git
  https://bugs.gentoo.org/638206
  https://patchwork.kernel.org/patch/10062263/
  bcache with ceph https://github.com/blueboxgroup/ursula/blob/master/roles/ceph-osd/tasks/bcache.yml