Tracker: https://tracker.ceph.com/issues/59165
Note: The following run had ceph/ceph#51425 applied.
With the new patch appplied, we skip the PGAdnvaceMap
ops when to
is ealier than from
epoch (where previously we instead crashed) .
There are few PGAdvanceMap events with a to
epoch of 375 while from
is already of epoch 376.
Affected pgs are:
- 13.7 --> Investigated below
- 54.6
- 9.4
- 51.12
- 14.b
cat osd.1.log | grep nowhere
:
DEBUG 2023-05-15 12:37:26,434 [shard 0] osd - pg_advance_map(id=245809, detail=PGAdvanceMap(pg=13.7 from=376 to=375)): 'from' is later than 'to' epoch,nowhere to advance - skipping
DEBUG 2023-05-15 12:37:26,857 [shard 0] osd - pg_advance_map(id=246672, detail=PGAdvanceMap(pg=54.6 from=376 to=375)): 'from' is later than 'to' epoch,nowhere to advance - skipping
DEBUG 2023-05-15 12:37:26,990 [shard 0] osd - pg_advance_map(id=245713, detail=PGAdvanceMap(pg=9.4 from=376 to=375)): 'from' is later than 'to' epoch,nowhere to advance - skipping
DEBUG 2023-05-15 12:37:27,188 [shard 0] osd - pg_advance_map(id=246603, detail=PGAdvanceMap(pg=51.12 from=376 to=375)): 'from' is later than 'to' epoch,nowhere to advance - skipping
DEBUG 2023-05-15 12:37:27,249 [shard 0] osd - pg_advance_map(id=245838, detail=PGAdvanceMap(pg=14.b from=376 to=375)): 'from' is later than 'to' epoch,nowhere to advance - skipping
Following pg=13.7
:
cat osd.*.pg.13.7.log | grep -E 'advancing map to 375|advancing map to 376'
:
- osd.0
DEBUG 2023-05-15 12:37:23,938 [shard 0] osd - pg_advance_map(id=239236, detail=PGAdvanceMap(pg=13.7 from=374 to=376)): advancing map to 375 .. DEBUG 2023-05-15 12:37:24,258 [shard 0] osd - pg_advance_map(id=239236, detail=PGAdvanceMap(pg=13.7 from=374 to=376)): advancing map to 376
- osd.1
DEBUG 2023-05-15 12:37:24,291 [shard 0] osd - pg_advance_map(id=247132, detail=PGAdvanceMap(pg=13.7 from=371 to=376)): advancing map to 375 .. DEBUG 2023-05-15 12:37:25,319 [shard 0] osd - pg_advance_map(id=247132, detail=PGAdvanceMap(pg=13.7 from=371 to=376)): advancing map to 376
- osd.2
None
- osd.3
DEBUG 2023-05-15 12:37:22,866 [shard 0] osd - pg_advance_map(id=215152, detail=PGAdvanceMap(pg=13.7 from=372 to=376)): advancing map to 375 .. DEBUG 2023-05-15 12:37:23,755 [shard 0] osd - pg_advance_map(id=215152, detail=PGAdvanceMap(pg=13.7 from=372 to=376)): advancing map to 376
-
[timestamp: 12:37:21] The "skipped"
PGAdnvaceMap
was scheduled her :return pg_shard_manager.broadcast_map_to_pgs(osdmap->get_epoch())
INFO 2023-05-15 12:37:21,658 [shard 0] osd - osd.1: committed_osd_maps: broadcasting osdmap.375 to pgs
-
[timestamp: 12:37:25] As seen in Step 2, this pg has advanced to
376
. -
[timestamp: 12:37:26] By the time the operation started,
from
epoch was already later than theto
originally scheduled.from = pg->get_osdmap_epoch();
DEBUG 2023-05-15 12:37:26,279 [shard 0] osd - pg_advance_map(id=245809, detail=PGAdvanceMap(pg=13.7 from=4294967295 to=375)): initializing from: 376 DEBUG 2023-05-15 12:37:26,434 [shard 0] osd - pg_advance_map(id=245809, detail=PGAdvanceMap(pg=13.7 from=376 to=375)): 'from' is later than 'to' epoch,nowhere to advance - skipping
How did pg_advance_map(id=247132 .. to=376)
started before pg_advance_map(id=245809 .. to=375)
INFO 2023-05-15 12:37:21,664 [shard 0] osd - PerShardState::start_operation, pg_advance_map(id=245809, detail=PGAdvanceMap(pg=13.7 from=4294967295 to=375))
DEBUG 2023-05-15 12:37:21,732 [shard 0] osd - pg_advance_map(id=245809, detail=PGAdvanceMap(pg=13.7 from=4294967295 to=375)): start
**(Notice timing)**
DEBUG 2023-05-15 12:37:26,279 [shard 0] osd - pg_advance_map(id=245809, detail=PGAdvanceMap(pg=13.7 from=4294967295 to=375)): initializing from: 376
INFO 2023-05-15 12:37:21,800 [shard 0] osd - PerShardState::start_operation, pg_advance_map(id=247132, detail=PGAdvanceMap(pg=13.7 from=4294967295 to=376))
DEBUG 2023-05-15 12:37:21,913 [shard 0] osd - pg_advance_map(id=247132, detail=PGAdvanceMap(pg=13.7 from=4294967295 to=376)): start
**(Notice timing)**
DEBUG 2023-05-15 12:37:22,544 [shard 0] osd - pg_advance_map(id=247132, detail=PGAdvanceMap(pg=13.7 from=4294967295 to=376)): initializing from: 371
pg_advance_map(id=245809 .. to=375)
Did start before pg_advance_map(id=247132 .. to=376)
. However, the from
initialization is not in the same order(!).
- [timestamp: 12:37:21,732] :
245809
Started. - [timestamp: 12:37:21,913] :
247132
Started. - [timestamp: 12:37:22] :
247132
initializiedfrom
. - [timestamp: 12:37:26] :
245809
initializiedfrom
.
- Set
from
tolast_broadcasted_epoch
at the creation ofPGAdnvaceMap
- Update
last_broadcasted_epoch
to the lastPGAdnvaceMap::to
epoch - Start a
PGAdnvaceMap
operation after the PGPipeline will await thelast_broadcasted_epoch
epoch.
-
Abstract example:
pg_advance_map(id=x, detail=PGAdvanceMap(pg=1 from=n to=n+2) start committed_osd_maps: broadcasting osdmap.n+3 to pgs pg_advance_map(id=x+1, detail=PGAdvanceMap(pg=1 from=n+2 to=n+3) Created committed_osd_maps: broadcasting osdmap.n+5 to pgs pg_advance_map(id=x+2, detail=PGAdvanceMap(pg=1 from=n+3 to=n+5) Created pg_advance_map(id=x+1, detail=PGAdvanceMap(pg=1 from=n+2 to=n+3) await map n+2 pg_advance_map(id=x+1, detail=PGAdvanceMap(pg=1 from=n+2 to=n+3) start pg_advance_map(id=x+2, detail=PGAdvanceMap(pg=1 from=n+3 to=n+5) await map n+3 pg_advance_map(id=x+2, detail=PGAdvanceMap(pg=1 from=n+3 to=n+5) start Exclusive epoch range for each PGAdvanceMap operation. last broadcasted epoch is updated only after creating the PGAdvanceMap op.
See: ceph/ceph#48121 for an earlier version of
form
initialization.
seastar::future<> PGAdvanceMap::start()
{
logger().debug("{}: start", *this);
..
return enter_stage<>(
pg->peering_request_pg_pipeline.process
).then([this]