- Acting Set - 当前或者某个 Interval 负责承载 PG 的 OSD 实例
- Up set - CRUSH 计算出的,当前或者某个 Interval 负责承载 PG 的 OSD 实例
通常两者内容应该是相同的,当在 OSDMap 中设置 PG Temp 显示指定 Acting Set 则 会导致两者不同。
// PG.h
set<pg_shard_t> actingbackfill, actingset, upset;
// osd_types.cc
struct pg_shard_t {
int32_t osd;
shard_id_t shard;
};
WRITE_CLASS_ENCODER(pg_shard_t)
Acting Set 其实是个有序集合,第一个 OSD 称为 Primary,其他称为 Replica。
peering 最终选出权威日志,通常是临时主的 PG Log。
epoch 为 OSDMap 的版本号,为 OSDMonitor 负责生成,总是递增。
// include/types.h
typedef __u32 epoch_t; // map epoch (32bits -> 13 epochs/second for 10 years)
class OSDMap {
private:
epoch_t epoch; // what epoch of the osd cluster descriptor is this
utime_t created, modified; // epoch start time
}
为了避免 epoch 消耗过快,将特定时间段内 OSDMap 的修改都折叠进同一 epoch 中。
PG Log 记录的是 OSD 最近看到的发生在 PG 上的操作。
/**
* pg_log_t - incremental log of recent pg changes.
*
* serves as a recovery queue for recent changes.
*/
struct pg_log_t {
/*
* head - newest entry (update|delete)
* tail - entry previous to oldest (update|delete) for which we have
* complete negative information.
* i.e. we can infer pg contents for any store whose last_update >= tail.
*/
eversion_t head; // newest entry
eversion_t tail; // version prior to oldest
protected:
// We can rollback rollback-able entries > can_rollback_to
eversion_t can_rollback_to;
// always <= can_rollback_to, indicates how far stashed rollback
// data can be found
eversion_t rollback_info_trimmed_to;
public:
// the actual log
mempool::osd_pglog::list<pg_log_entry_t> log; // 插入总是通过 push_back 加到链表末尾
// entries just for dup op detection ordered oldest to newest
mempool::osd_pglog::list<pg_log_dup_t> dups;
PG Log Entry 条数在 Ceph 中称为 target:
size_t target = cct->_conf->osd_min_pg_log_entries;
target 条数:
- 通常为 cct->_conf->osd_min_pg_log_entries
- 当 PG 处于 degraded 下为 cct->_conf->osd_max_pg_log_entries
eversion 由 epoch 和 version 组成,其中 version 由 Primary 生成,连续递增。 eversion 唯一标识一次 PG 内的修改操作,也就是 PG Log Entry:
// include/types.h
typedef uint64_t version_t;
// osd_types.h
class eversion_t {
public:
version_t version;
epoch_t epoch;
__u32 __pad;
}
struct pg_log_entry_t {
// describes state for a locally-rollbackable entry
ObjectModDesc mod_desc;
bufferlist snaps; // only for clone entries
hobject_t soid;
osd_reqid_t reqid; // caller+tid to uniquely identify request
mempool::osd_pglog::vector<pair<osd_reqid_t, version_t> > extra_reqids;
eversion_t version, prior_version, reverting_to; // version 为本次修改生效之后对象的版本
// prior_version 为本次修改生效前对象的版本
// reverting_to 为对 unfound/unrecoverable 对象回滚时,对象待回滚的版本
version_t user_version; // the user version for this entry, 客户端可见的对象版本
utime_t mtime; // this is the _user_ mtime, mind you, 对于客户端发起的 op 会带上客户端生成 op 时的本地时间
int32_t return_code; // only stored for ERRORs for dup detection
}
PG Info 为 PG 的统计信息,在任何 OSD 之间关于 PG 进行通信时必须包含 PG Info:
basic metadata about the PG’s creation epoch, the version for the most recent write to the PG, last epoch started, last epoch clean, and the beginning of the current interval. Any inter-OSD communication about PGs includes the PG info, such that any OSD that knows a PG exists (or once existed) also has a lower bound on last epoch clean or last epoch started.
在 peering 过程中会交换 PG Info,然后由当前 Primary 选举出 Authoritative,这是后续进行 PG Log 和数据同步的依据:
struct pg_info_t {
spg_t pgid;
eversion_t last_update; ///< last object version applied to store.
eversion_t last_complete; ///< last version pg was complete through.
epoch_t last_epoch_started; ///< last epoch at which this pg started on this osd
epoch_t last_interval_started; ///< first epoch of last_epoch_started interval
version_t last_user_version; ///< last user object version applied to store
eversion_t log_tail; ///< oldest log entry.
hobject_t last_backfill; ///< objects >= this and < last_complete may be missing
bool last_backfill_bitwise; ///< true if last_backfill reflects a bitwise (vs nibblewise) sort
interval_set<snapid_t> purged_snaps;
pg_stat_t stats;
pg_history_t history; // <- PG History
pg_hit_set_history_t hit_set;
}
其中 last_complete 和 last_update 在正常情况下应该指向同一 log entry, 当出现故障时两者才会不同:
- last_update 控制 PG Log Entry 是否在本 OSD 上记录
- last_complete 控制 PG Log Entry 是否在本 OSD 上生效
PG Interval 指 OSDMap 在一个连续的 epoch 期间内,PG 的 Acting/Up Set 都没有发 生变化;
class PastIntervals {
public:
struct pg_interval_t {
vector<int32_t> up, acting;
epoch_t first, last;
bool maybe_went_rw;
int32_t primary;
int32_t up_primary;
};
}
在 PG History 中每个 Interval 的起始 epoch 就是 same_interval_since。
PG History 记录 PG 最近 peering/mapping 的历史信息,其也是 PG Info 的 成员:
// osd_types.h
/**
* pg_history_t - information about recent pg peering/mapping history
*
* This is aggressively shared between OSDs to bound the amount of past
* history they need to worry about.
*/
struct pg_history_t {
epoch_t epoch_created; // epoch in which *pg* was created (pool or pg)
epoch_t epoch_pool_created; // epoch in which *pool* was created
// (note: may be pg creation epoch for
// pre-luminous clusters)
epoch_t last_epoch_started; // lower bound on last epoch started (anywhere, not necessarily locally)
epoch_t last_interval_started; // first epoch of last_epoch_started interval
epoch_t last_epoch_clean; // lower bound on last epoch the PG was completely clean.
epoch_t last_interval_clean; // first epoch of last_epoch_clean interval
epoch_t last_epoch_split; // as parent or child
epoch_t last_epoch_marked_full; // pool or cluster
/**
* In the event of a map discontinuity, same_*_since may reflect the first
* map the osd has seen in the new map sequence rather than the actual start
* of the interval. This is ok since a discontinuity at epoch e means there
* must have been a clean interval between e and now and that we cannot be
* in the active set during the interval containing e.
*/
epoch_t same_up_since; // same acting set since
epoch_t same_interval_since; // same acting AND up set since
epoch_t same_primary_since; // same primary at least back through this epoch.
eversion_t last_scrub;
eversion_t last_deep_scrub;
utime_t last_scrub_stamp;
utime_t last_deep_scrub_stamp;
utime_t last_clean_scrub_stamp;
}
recovery 的过程由 Primary OSD 主导,其中:
- Primary 发现自身有对象需要同步则从 Replica 上 pull
- Primary 发现 Replica 有对象需要同步则进行 push
// ReplicatedBackend.h
map<pg_shard_t, vector<PushOp> > pushes;
map<pg_shard_t, vector<PullOp> > pulls;
// osd/osd_types.h
using pg_missing_tracker_t = pg_missing_set<true>;
// osd/PGLog.h
pg_missing_tracker_t missing;
class pg_missing_set : public pg_missing_const_i {
using item = pg_missing_item;
map<hobject_t, item> missing; // oid -> (need v, have v)
map<version_t, hobject_t> rmissing; // v -> oid
ChangeTracker<TrackChanges> tracker;
}
向 missing 中添加新的 pg_log_entry,是在 merge_log 中调用:
void add_next_event(const pg_log_entry_t& e) {}
添加新的 PG 状态 activating 的 commit 77bc23c3ac684516ffe4d93be91b82cfef41b4a0
PG 处于某种状态下的典型场景:
PG 最理想的状态就是 active+clean
,下面列出了用户可见的 PG 状态,最前面三种属
于 PG stuck 时的状态。
- creating
- 正在创建中,然后进行 peering
- peering
- PG 正在 perring 的过程中,peering 对 log,从而对承载 PG 的 OSD 上所有的对象以及元数据达成一致的过程(并不是说所有 PG 中对象都一 样,都是最新的)。
- active
- 经过 peering 之后的 PG 就到了 active,active 的 PG 中对象是可读写的
- clean
- PG 中对象的各个副本一致,inconsistence 的正常状态
- unclean
- PG 中的副本数没有达到要求,PG 应该做 recovery
- inactive
- 主 OSD 上的数据并不是最新的,这个时候 PG 是不能读写的
- down
- OSD down 了(个数达到了
pg query
的down_osds_we_would_probe
), 导致 PG 上数据不可用了。这个时候 OSD 如何活过来了就没事了,如果 活不过来的话,显式的声明成 lost - replay
- 在 OSD 奔溃后,PG 在等待 client replay 之前的操作
- scrubbing
- scrubbing
- repair
- Ceph 正在修复 inconsistence 的问题
- degraded
- PG 中非副本数没有达到要求个数,不需要人工介入,当达到
mon_osd_down_out_interval
时就会进行 recovery - undersized
- 是 degraded 的特殊形式,表示 PG 无法选出足够的 actingset。例如每天存储池 size 是 3,但是 CRUSH 配置的问题导致只能选出 来 2 个 OSD,那就是会变成 undersized
- inconsistence
- 同一 PG 中多个副本不一致(对象大小不一致,在经历
recovery 之后,对象的副本数还不一致),由 scrub 或者 deep-scrub 操
作发现,
ceph pg repair
可以修改一些类型 - activating
- PG 经历了 peering 之后不会直接 active,主 OSD 需要将 peer 的 结果分发给其他 OSD,这个过程就是 active 的过程。进行 active 的状态就是 activating
- peered
- peering 之后又不能进行 active 过程,因为 PG 的 actingset 中 的 OSD 数目还没有达到 pool 的 min_size 要求
- recovering
- PG 正在迁移/同步对象或者副本
- backfilling
- 是 recovering 的特殊形式,Ceph 扫描并同步整个 PG 内的全部 对象,而不是从日志或者最近的操作中去 infer 到底有哪些变更
- backfill_wait
- PG 正在准备进行 backfill
- backfill_toofull
- PG 在等待进入 backfill,等待的原因在于目标 OSD 已经 达到 full ratio 了
- misplaced
- 是说 PG 中的对象是有三副本的,但是放置问题有问题。( Lets say there are 3 OSDs: 0,1,2 and all PGs map to some permutation of those three. If you add another OSD (OSD 3), some PGs will now map to OSD 3 instead of one of the others. However, until OSD 3 is backfilled, the PG will have a temporary mapping allowing it to continue to serve I/O from the old mapping. During that time, the PG is misplaced, because it has a temporary mapping, but not degraded, since there are 3 copies.)
- incomplete
- PG 选不出权威的 PGLog 了,可能是有权威 PGLog 的 OSD down 掉了,或者更严重的是 pg_history 异常了(这种情况下,万不得已看看 osd_find_best_info_ignore_history_les)(Lets say OSD 1, 2, and 3 are the acting OSD set and it switches to OSD 1, 4, and 3, then osd.1 will request a temporary acting set of OSD 1, 2, and 3 while backfilling 4. During this time, if OSD 1, 2, and 3 all go down, osd.4 will be the only one left which might not have fully backfilled all the data. At this time, the PG will go incomplete indicating that there are no complete OSDs which are current enough to perform recovery.)
- stale
- OSD 不能映射自己保存的全部对象,要自己看 ceph health detail 中 last acting 中涉及的 OSD( Alternately, if osd.4 is not involved and the acting set is simply OSD 1, 2, and 3 when OSD 1, 2, and 3 go down, the PG would likely go stale indicating that the mons have not heard anything on that PG since the acting set changed. The reason being there are no OSDs left to notify the new OSDs.)
- stuck
- PG 中的 OSD 没有收到周期性的心跳
- remmaped
- 当 PG 的 acting set 变化时,需要将老的 acting set 中数据迁移 到新的 acting set,这个过程需要一段时间。这个时候如果 PG 为 remapped 状态的话,在迁移的过程中老的 acting set 中的主 OSD 依然可以提供服务。数据迁移完成之后,就不需要继续 remapped 了
- stray
- 表示 PG 所在的 OSD 不再 PG 的 Acting Set 中