mikulely/recovery.org

## recovery.org

      
    Raw
  

              recovery.org
            
          
    Core Concept

Acting/Up Set


  Acting Set - 当前或者某个 Interval 负责承载 PG 的 OSD 实例
  Up set - CRUSH 计算出的，当前或者某个 Interval 负责承载 PG 的 OSD 实例

通常两者内容应该是相同的，当在 OSDMap 中设置 PG Temp 显示指定 Acting Set 则
  会导致两者不同。
// PG.h
set<pg_shard_t> actingbackfill, actingset, upset;

// osd_types.cc
struct pg_shard_t {
  int32_t osd;
  shard_id_t shard;
};
WRITE_CLASS_ENCODER(pg_shard_t)
Acting Set 其实是个有序集合,第一个 OSD 称为 Primary，其他称为 Replica。
Authoritative

peering 最终选出权威日志，通常是临时主的 PG Log。
OSDMap epoch

epoch 为 OSDMap 的版本号，为 OSDMonitor 负责生成,总是递增。
// include/types.h
typedef __u32 epoch_t;       // map epoch  (32bits -> 13 epochs/second for 10 years)

class OSDMap {

  private:
    epoch_t epoch;        // what epoch of the osd cluster descriptor is this
    utime_t created, modified; // epoch start time
 }
为了避免 epoch 消耗过快，将特定时间段内 OSDMap 的修改都折叠进同一
  epoch 中。
PG Log - The State of PG

PG Log 记录的是 OSD 最近看到的发生在 PG 上的操作。
/**
 * pg_log_t - incremental log of recent pg changes.
 *
 *  serves as a recovery queue for recent changes.
 */
struct pg_log_t {
  /*
   *   head - newest entry (update|delete)
   *   tail - entry previous to oldest (update|delete) for which we have
   *          complete negative information.
   * i.e. we can infer pg contents for any store whose last_update >= tail.
   */
  eversion_t head;    // newest entry
  eversion_t tail;    // version prior to oldest

protected:
  // We can rollback rollback-able entries > can_rollback_to
  eversion_t can_rollback_to;

  // always <= can_rollback_to, indicates how far stashed rollback
  // data can be found
  eversion_t rollback_info_trimmed_to;

public:
  // the actual log
  mempool::osd_pglog::list<pg_log_entry_t> log; // 插入总是通过 push_back 加到链表末尾

  // entries just for dup op detection ordered oldest to newest
  mempool::osd_pglog::list<pg_log_dup_t> dups;
PG Log Entry

PG Log 条数 - target

PG Log Entry 条数在 Ceph 中称为 target:
size_t target = cct->_conf->osd_min_pg_log_entries;
target 条数:

  通常为 cct->_conf->osd_min_pg_log_entries
  当 PG 处于 degraded 下为 cct->_conf->osd_max_pg_log_entries

PG Log 唯一标示 - eversion

eversion 由 epoch 和 version 组成，其中 version 由 Primary 生成，连续递增。
  eversion 唯一标识一次 PG 内的修改操作,也就是 PG Log Entry:
// include/types.h
typedef uint64_t version_t;

// osd_types.h
class eversion_t {
public:
  version_t version;
  epoch_t epoch;
  __u32 __pad;
}
struct pg_log_entry_t {
  // describes state for a locally-rollbackable entry
  ObjectModDesc mod_desc;
  bufferlist snaps;   // only for clone entries
  hobject_t  soid;
  osd_reqid_t reqid;  // caller+tid to uniquely identify request
  mempool::osd_pglog::vector<pair<osd_reqid_t, version_t> > extra_reqids;
  eversion_t version, prior_version, reverting_to; // version 为本次修改生效之后对象的版本
                                                   // prior_version 为本次修改生效前对象的版本
                                                   // reverting_to 为对 unfound/unrecoverable 对象回滚时，对象待回滚的版本
  version_t user_version; // the user version for this entry, 客户端可见的对象版本
  utime_t     mtime;  // this is the _user_ mtime, mind you, 对于客户端发起的 op 会带上客户端生成 op 时的本地时间
  int32_t return_code; // only stored for ERRORs for dup detection
}
PG Info - summary of PG statistics

PG Info 为 PG 的统计信息，在任何 OSD 之间关于 PG 进行通信时必须包含 PG Info:

  basic metadata about the PG’s creation epoch, the version for the
    most recent write to the PG, last epoch started, last epoch clean,
    and the beginning of the current interval. Any inter-OSD
    communication about PGs includes the PG info, such that any OSD
    that knows a PG exists (or once existed) also has a lower bound on
    last epoch clean or last epoch started.

在 peering 过程中会交换 PG Info，然后由当前 Primary 选举出
  Authoritative，这是后续进行 PG Log 和数据同步的依据:
struct pg_info_t {
  spg_t pgid;
  eversion_t last_update;      ///< last object version applied to store.
  eversion_t last_complete;    ///< last version pg was complete through.
  epoch_t last_epoch_started;  ///< last epoch at which this pg started on this osd
  epoch_t last_interval_started; ///< first epoch of last_epoch_started interval

  version_t last_user_version; ///< last user object version applied to store

  eversion_t log_tail;         ///< oldest log entry.

  hobject_t last_backfill;     ///< objects >= this and < last_complete may be missing
  bool last_backfill_bitwise;  ///< true if last_backfill reflects a bitwise (vs nibblewise) sort

  interval_set<snapid_t> purged_snaps;

  pg_stat_t stats;

  pg_history_t history; // <- PG History
  pg_hit_set_history_t hit_set;
}
其中 last_complete 和 last_update 在正常情况下应该指向同一 log entry,
  当出现故障时两者才会不同:

  last_update 控制 PG Log Entry 是否在本 OSD 上记录
  last_complete 控制 PG Log Entry 是否在本 OSD 上生效

PG Interval

PG Interval 指 OSDMap 在一个连续的 epoch 期间内，PG 的 Acting/Up Set 都没有发
  生变化;
class PastIntervals {
public:
  struct pg_interval_t {
    vector<int32_t> up, acting;
    epoch_t first, last;
    bool maybe_went_rw;
    int32_t primary;
    int32_t up_primary;
  };
}
在 PG History 中每个 Interval 的起始 epoch 就是 same_interval_since。
PG History

PG History 记录 PG 最近 peering/mapping 的历史信息，其也是 PG Info 的
  成员:
// osd_types.h
/**
 * pg_history_t - information about recent pg peering/mapping history
 *
 * This is aggressively shared between OSDs to bound the amount of past
 * history they need to worry about.
 */
struct pg_history_t {
  epoch_t epoch_created;       // epoch in which *pg* was created (pool or pg)
  epoch_t epoch_pool_created;  // epoch in which *pool* was created
                               // (note: may be pg creation epoch for
                               // pre-luminous clusters)
  epoch_t last_epoch_started;  // lower bound on last epoch started (anywhere, not necessarily locally)
  epoch_t last_interval_started; // first epoch of last_epoch_started interval
  epoch_t last_epoch_clean;    // lower bound on last epoch the PG was completely clean.
  epoch_t last_interval_clean; // first epoch of last_epoch_clean interval
  epoch_t last_epoch_split;    // as parent or child
  epoch_t last_epoch_marked_full;  // pool or cluster

  /**
   * In the event of a map discontinuity, same_*_since may reflect the first
   * map the osd has seen in the new map sequence rather than the actual start
   * of the interval.  This is ok since a discontinuity at epoch e means there
   * must have been a clean interval between e and now and that we cannot be
   * in the active set during the interval containing e.
   */
  epoch_t same_up_since;       // same acting set since
  epoch_t same_interval_since;   // same acting AND up set since
  epoch_t same_primary_since;  // same primary at least back through this epoch.

  eversion_t last_scrub;
  eversion_t last_deep_scrub;
  utime_t last_scrub_stamp;
  utime_t last_deep_scrub_stamp;
  utime_t last_clean_scrub_stamp;
}
Push/Pull

recovery 的过程由 Primary OSD 主导,其中:

  Primary 发现自身有对象需要同步则从 Replica 上 pull
  Primary 发现 Replica 有对象需要同步则进行 push

// ReplicatedBackend.h
map<pg_shard_t, vector<PushOp> > pushes;
map<pg_shard_t, vector<PullOp> > pulls;
missing

// osd/osd_types.h
using pg_missing_tracker_t = pg_missing_set<true>;

// osd/PGLog.h
pg_missing_tracker_t missing;

class pg_missing_set : public pg_missing_const_i {
  using item = pg_missing_item;
  map<hobject_t, item> missing;  // oid -> (need v, have v)
  map<version_t, hobject_t> rmissing;  // v -> oid
  ChangeTracker<TrackChanges> tracker;
}
向 missing 中添加新的 pg_log_entry,是在 merge_log 中调用:
void add_next_event(const pg_log_entry_t& e) {}
PG 状态机

添加新的 PG 状态 activating 的 commit 77bc23c3ac684516ffe4d93be91b82cfef41b4a0

  Ceph Documentation » Ceph Storage Cluster » Cluster Operations » Placement Group,

PG 处于某种状态下的典型场景:

  http://docs.ceph.com/docs/master/rados/operations/monitoring-osd-pg/#monitoring-osds

PG 最理想的状态就是 active+clean ，下面列出了用户可见的 PG 状态，最前面三种属
  于 PG stuck 时的状态。

  creating
正在创建中,然后进行 peering
  peering
PG 正在 perring 的过程中,peering 对 log,从而对承载 PG 的 OSD
    上所有的对象以及元数据达成一致的过程(并不是说所有 PG 中对象都一
    样,都是最新的)。
  active
经过 peering 之后的 PG 就到了 active，active 的 PG 中对象是可读写的
  clean
PG 中对象的各个副本一致，inconsistence 的正常状态
  unclean
PG 中的副本数没有达到要求,PG 应该做 recovery
  inactive
主 OSD 上的数据并不是最新的，这个时候 PG 是不能读写的
  down
OSD down 了(个数达到了 pg query 的 down_osds_we_would_probe)，
    导致 PG 上数据不可用了。这个时候 OSD 如何活过来了就没事了，如果
    活不过来的话，显式的声明成 lost
  replay
在 OSD 奔溃后，PG 在等待 client replay 之前的操作
  scrubbing
scrubbing
  repair
Ceph 正在修复 inconsistence 的问题
  degraded
PG 中非副本数没有达到要求个数,不需要人工介入，当达到
    mon_osd_down_out_interval 时就会进行 recovery
  undersized
是 degraded 的特殊形式，表示 PG 无法选出足够的
    actingset。例如每天存储池 size 是 3，但是 CRUSH 配置的问题导致只能选出
    来 2 个 OSD，那就是会变成 undersized
  inconsistence
同一 PG 中多个副本不一致(对象大小不一致，在经历
    recovery 之后，对象的副本数还不一致),由 scrub 或者 deep-scrub 操
    作发现, ceph pg repair 可以修改一些类型
  activating
PG 经历了 peering 之后不会直接 active，主 OSD 需要将 peer 的
    结果分发给其他 OSD，这个过程就是 active 的过程。进行 active 的状态就是 activating
  peered
peering 之后又不能进行 active 过程，因为 PG 的 actingset 中
    的 OSD 数目还没有达到 pool 的 min_size 要求
  recovering
PG 正在迁移/同步对象或者副本
  backfilling
是 recovering 的特殊形式,Ceph 扫描并同步整个 PG 内的全部
    对象，而不是从日志或者最近的操作中去 infer 到底有哪些变更
  backfill_wait
PG 正在准备进行 backfill
  backfill_toofull
PG 在等待进入 backfill,等待的原因在于目标 OSD 已经
    达到 full ratio 了
  misplaced
是说 PG 中的对象是有三副本的，但是放置问题有问题。(
    Lets say there are 3 OSDs: 0,1,2 and all PGs map to some
    permutation of those three. If you add another OSD (OSD 3), some
    PGs will now map to OSD 3 instead of one of the others. However,
    until OSD 3 is backfilled, the PG will have a temporary mapping
    allowing it to continue to serve I/O from the old mapping. During
    that time, the PG is misplaced, because it has a temporary
    mapping, but not degraded, since there are 3 copies.)
  incomplete
PG 选不出权威的 PGLog 了,可能是有权威 PGLog 的 OSD down
    掉了，或者更严重的是 pg_history 异常了(这种情况下，万不得已看看
    osd_find_best_info_ignore_history_les)(Lets say OSD 1, 2, and 3
    are the acting OSD set and it switches to OSD 1, 4, and 3, then
    osd.1 will request a temporary acting set of OSD 1, 2, and 3
    while backfilling 4. During this time, if OSD 1, 2, and 3 all go
    down, osd.4 will be the only one left which might not have fully
    backfilled all the data. At this time, the PG will go incomplete
    indicating that there are no complete OSDs which are current
    enough to perform recovery.)
  stale
OSD 不能映射自己保存的全部对象,要自己看 ceph health detail
    中 last acting 中涉及的 OSD( Alternately, if osd.4 is not
    involved and the acting set is simply OSD 1, 2, and 3 when OSD 1,
    2, and 3 go down, the PG would likely go stale indicating that
    the mons have not heard anything on that PG since the acting set
    changed. The reason being there are no OSDs left to notify the
    new OSDs.)
  stuck
PG 中的 OSD 没有收到周期性的心跳
  remmaped
当 PG 的 acting set 变化时，需要将老的 acting set 中数据迁移
    到新的 acting set，这个过程需要一段时间。这个时候如果 PG
    为 remapped 状态的话，在迁移的过程中老的 acting set 中的主
    OSD 依然可以提供服务。数据迁移完成之后，就不需要继续
    remapped 了
  stray
表示 PG 所在的 OSD 不再 PG 的 Acting Set 中

References


  Placement Group States