Skip to content

Instantly share code, notes, and snippets.

@Matan-B
Last active May 21, 2023 11:38
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Matan-B/67a6c2a9b9c581600799c2c37704e913 to your computer and use it in GitHub Desktop.
Save Matan-B/67a6c2a9b9c581600799c2c37704e913 to your computer and use it in GitHub Desktop.

Tracker: https://tracker.ceph.com/issues/59165

Note: The following run had ceph/ceph#51425 applied.

Step 1: Replacing the crash by skipping

With the new patch appplied, we skip the PGAdnvaceMap ops when to is ealier than from epoch (where previously we instead crashed) .

There are few PGAdvanceMap events with a to epoch of 375 while from is already of epoch 376.

Affected pgs are:

  • 13.7 --> Investigated below
  • 54.6
  • 9.4
  • 51.12
  • 14.b

cat osd.1.log | grep nowhere:

DEBUG 2023-05-15 12:37:26,434 [shard 0] osd - pg_advance_map(id=245809, detail=PGAdvanceMap(pg=13.7 from=376 to=375)): 'from' is later than 'to' epoch,nowhere to advance - skipping
DEBUG 2023-05-15 12:37:26,857 [shard 0] osd - pg_advance_map(id=246672, detail=PGAdvanceMap(pg=54.6 from=376 to=375)): 'from' is later than 'to' epoch,nowhere to advance - skipping
DEBUG 2023-05-15 12:37:26,990 [shard 0] osd - pg_advance_map(id=245713, detail=PGAdvanceMap(pg=9.4 from=376 to=375)): 'from' is later than 'to' epoch,nowhere to advance - skipping
DEBUG 2023-05-15 12:37:27,188 [shard 0] osd - pg_advance_map(id=246603, detail=PGAdvanceMap(pg=51.12 from=376 to=375)): 'from' is later than 'to' epoch,nowhere to advance - skipping
DEBUG 2023-05-15 12:37:27,249 [shard 0] osd - pg_advance_map(id=245838, detail=PGAdvanceMap(pg=14.b from=376 to=375)): 'from' is later than 'to' epoch,nowhere to advance - skipping

PG 13.7 advancing map to 375/376 logs (all OSDs)

Following pg=13.7:

cat osd.*.pg.13.7.log | grep -E 'advancing map to 375|advancing map to 376':

  • osd.0
    DEBUG 2023-05-15 12:37:23,938 [shard 0] osd - pg_advance_map(id=239236, detail=PGAdvanceMap(pg=13.7 from=374 to=376)): advancing map to 375
    ..
    DEBUG 2023-05-15 12:37:24,258 [shard 0] osd - pg_advance_map(id=239236, detail=PGAdvanceMap(pg=13.7 from=374 to=376)): advancing map to 376
    
  • osd.1
    DEBUG 2023-05-15 12:37:24,291 [shard 0] osd - pg_advance_map(id=247132, detail=PGAdvanceMap(pg=13.7 from=371 to=376)): advancing map to 375
    ..
    DEBUG 2023-05-15 12:37:25,319 [shard 0] osd - pg_advance_map(id=247132, detail=PGAdvanceMap(pg=13.7 from=371 to=376)): advancing map to 376
    
  • osd.2
    None
    
  • osd.3
    DEBUG 2023-05-15 12:37:22,866 [shard 0] osd - pg_advance_map(id=215152, detail=PGAdvanceMap(pg=13.7 from=372 to=376)): advancing map to 375
    ..
    DEBUG 2023-05-15 12:37:23,755 [shard 0] osd - pg_advance_map(id=215152, detail=PGAdvanceMap(pg=13.7 from=372 to=376)): advancing map to 376
    

Step 2: How did the PG advance past to epoch

  • [timestamp: 12:37:21] The "skipped" PGAdnvaceMap was scheduled her :

    return pg_shard_manager.broadcast_map_to_pgs(osdmap->get_epoch())

    INFO  2023-05-15 12:37:21,658 [shard 0] osd - osd.1: committed_osd_maps: broadcasting osdmap.375 to pgs
    
  • [timestamp: 12:37:25] As seen in Step 2, this pg has advanced to 376.

  • [timestamp: 12:37:26] By the time the operation started, from epoch was already later than the to originally scheduled.

    from = pg->get_osdmap_epoch();

    DEBUG 2023-05-15 12:37:26,279 [shard 0] osd - pg_advance_map(id=245809, detail=PGAdvanceMap(pg=13.7 from=4294967295 to=375)): initializing from: 376
    DEBUG 2023-05-15 12:37:26,434 [shard 0] osd - pg_advance_map(id=245809, detail=PGAdvanceMap(pg=13.7 from=376 to=375)): 'from' is later than 'to' epoch,nowhere to advance - skipping
    

Step 3: Bad execution timing?

How did pg_advance_map(id=247132 .. to=376) started before pg_advance_map(id=245809 .. to=375)

INFO  2023-05-15 12:37:21,664 [shard 0] osd - PerShardState::start_operation, pg_advance_map(id=245809, detail=PGAdvanceMap(pg=13.7 from=4294967295 to=375))
DEBUG 2023-05-15 12:37:21,732 [shard 0] osd - pg_advance_map(id=245809, detail=PGAdvanceMap(pg=13.7 from=4294967295 to=375)): start
**(Notice timing)**
DEBUG 2023-05-15 12:37:26,279 [shard 0] osd - pg_advance_map(id=245809, detail=PGAdvanceMap(pg=13.7 from=4294967295 to=375)): initializing from: 376
INFO  2023-05-15 12:37:21,800 [shard 0] osd - PerShardState::start_operation, pg_advance_map(id=247132, detail=PGAdvanceMap(pg=13.7 from=4294967295 to=376))
DEBUG 2023-05-15 12:37:21,913 [shard 0] osd - pg_advance_map(id=247132, detail=PGAdvanceMap(pg=13.7 from=4294967295 to=376)): start
**(Notice timing)**
DEBUG 2023-05-15 12:37:22,544 [shard 0] osd - pg_advance_map(id=247132, detail=PGAdvanceMap(pg=13.7 from=4294967295 to=376)): initializing from: 371

pg_advance_map(id=245809 .. to=375) Did start before pg_advance_map(id=247132 .. to=376). However, the from initialization is not in the same order(!).

  • [timestamp: 12:37:21,732] : 245809 Started.
  • [timestamp: 12:37:21,913] : 247132 Started.
  • [timestamp: 12:37:22] : 247132 initializied from.
  • [timestamp: 12:37:26] : 245809 initializied from.

Possible solution (last_broadcasted_epoch):

Correct starting timing

  1. Set from to last_broadcasted_epoch at the creation of PGAdnvaceMap
  2. Update last_broadcasted_epoch to the last PGAdnvaceMap::to epoch
  3. Start a PGAdnvaceMap operation after the PGPipeline will await the last_broadcasted_epoch epoch.
  • Abstract example:

    pg_advance_map(id=x, detail=PGAdvanceMap(pg=1 from=n to=n+2) start
    committed_osd_maps: broadcasting osdmap.n+3 to pgs
    pg_advance_map(id=x+1, detail=PGAdvanceMap(pg=1 from=n+2 to=n+3) Created
    committed_osd_maps: broadcasting osdmap.n+5 to pgs
    pg_advance_map(id=x+2, detail=PGAdvanceMap(pg=1 from=n+3 to=n+5) Created
    
    pg_advance_map(id=x+1, detail=PGAdvanceMap(pg=1 from=n+2 to=n+3) await map n+2
    pg_advance_map(id=x+1, detail=PGAdvanceMap(pg=1 from=n+2 to=n+3) start
    pg_advance_map(id=x+2, detail=PGAdvanceMap(pg=1 from=n+3 to=n+5) await map n+3
    pg_advance_map(id=x+2, detail=PGAdvanceMap(pg=1 from=n+3 to=n+5) start
    
    Exclusive epoch range for each PGAdvanceMap operation. last broadcasted epoch is updated only after creating the PGAdvanceMap op.
    

    See: ceph/ceph#48121 for an earlier version of form initialization.

WIP: Cause of delayed execution:

seastar::future<> PGAdvanceMap::start()
{
  logger().debug("{}: start", *this);
  ..
  return enter_stage<>(
    pg->peering_request_pg_pipeline.process
   ).then([this]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment