Skip to content

Instantly share code, notes, and snippets.

Millions of Tiny Databases

Intro

Physalia 是给 EBS 存储元数据定制(EBS-specialized)的 transactional kv store

  • 13年开发,覆盖了 AWS 60 个 datacenter
  • EBS 场景对 kv store 的需求
    • highly available

Emerging Storage Technologies: A Brief Overview

传统的 SSD 之后涌现的四类新介质

  1. Zoned Namespace SSD
    • 关键设计
      • 空间分 zone,zone内顺序写,随机读
    • 关键优势
      • eliminates the costly garbage collection
      • media over-provisioning

More Than Capacity: Performance-oriented Evolution of Pangu in Alibaba

盘古演进的驱动力

  • hardware technologies
  • the business mode

pangu 2.0 设计目标

  • 存算分离下提供 100μs-level I/O latency

orch-e2

安装 vagrant

在 mac 上

brew install vagrant

Deploy Ceph

Building from scratch was great for two reasons: we were able to deeply integrate it in our architecture, and we developed a deep understanding of its internals.

Fuel

需要注意的是单个 rados-object 默认的大小为 4M。接下来一一说明:

  1. 对于大小小于 head(rgw_max_chunk_size) 的 rgw-object, rgw-object 和 rados-object 一一对应。
    • 若对应的存储桶未启用 versioning 功能,则名字为 <zone_name + '.' + bucket_instance_id + '_' + object_name> ,例如 default.123_foo
    • 若对应的存储桶启用了 versioning 功能,则名字为 =<zone_name + ‘.’ + bucket_instance_id + ‘_’ + ‘:’ + object_instance_id + ‘_’ + object_name>=,例如 default.123_:o.DaikQGHRdqn8J5nudQNxX0gYKEvdr_foo

FAST10 End-to-end Data Integrity for File Systems: A ZFS Case Study

Disk/Memory Corruption

  • disk corruption ZFS是抗住了
  • memory corruption 历来重视不足, ZFS并不能在出现 memory corruption 的情况下依然保证完整性
    • Hardware-based memory corruption occurs as both transient soft errors and repeatable hard errors due to a variety of radiation mechanisms
    • Software can also cause memory corruption; bugs can lead to “wild writes”
# 使用 14TB HDD 盘,每个节点 36 盘位,服役时间从 mission_time 从 1 年到 5 年,默认 slice 分配为 4M
# total_iterations 应该能被 n 整除
## 盘位
### 12 disks_per_node
python simedc.py -n 12 -k 10 -g 3,3,3,3 -t rs -T hie --num_processes=1 --network_setting=125,125 --total_iterations=120000 --chunk_size=4 --num_stripes=1024 --capacity_per_disk=12582912 --disks_per_node=12 --mission_time=43800 > disk_12.txt # 5 years
### 6 disks_per_node
desc: (none)
cmd: /usr/local/bin/fio ./https1.fio
time_unit: i
#-----------
snapshot=0
#-----------
time=0
mem_heap_B=0
mem_heap_extra_B=0
mem_stacks_B=0

多 site 的存储系统

对于单个文件跨多 site 存储的好处显而易见。CleverSafe 内部的实验床从 4site 改成了 8site。

多 site 存储带来的副作用就是安全,因为攻击者从多个 site 获取 k 块分片的成本 会变得更高。CleverSafe 认为这样做就没有必要再用秘钥加密了。

传统的加密算法 Shamir secret sharing 安全性不依赖于秘钥的安全是很难的一 件事情,AONT-RS 结合了 AONT 和 systematic RS 码 实现了计算