Error racing read got wrong version
on smithi11232367-65
:
2024-04-11T12:39:31.043 INFO:tasks.rados.rados.0.smithi112.stdout:65: write initial oid 65
2024-04-11T12:39:31.043 INFO:tasks.rados.rados.0.smithi112.stdout:65: seq_num 64 ranges {675648=546760,2012024=568349}
2024-04-11T12:39:31.052 INFO:tasks.rados.rados.0.smithi112.stdout:65: writing smithi11232367-65 from 675648 to 1222408 tid 1
2024-04-11T12:39:31.064 INFO:tasks.rados.rados.0.smithi112.stdout:65: writing smithi11232367-65 from 2012024 to 2580373 tid 2
2024-04-11T12:39:31.210 INFO:tasks.rados.rados.0.smithi112.stdout:65: finishing write tid 1 to smithi11232367-65
2024-04-11T12:39:31.237 INFO:tasks.rados.rados.0.smithi112.stdout:65: finishing write tid 2 to smithi11232367-65
2024-04-11T12:39:31.250 INFO:tasks.thrashosds.thrasher:Killing osd 3, live_osds are [0, 1, 2, 3]
2024-04-11T12:39:36.322 INFO:tasks.thrashosds.thrasher:Reviving osd 3
2024-04-11T12:39:43.108 INFO:tasks.rados.rados.0.smithi112.stdout: waiting on 16
2024-04-11T12:39:43.108 INFO:tasks.rados.rados.0.smithi112.stdout:65: finishing write tid 3 to smithi11232367-65
2024-04-11T12:39:43.111 INFO:tasks.rados.rados.0.smithi112.stdout:65: finishing write tid 4 to smithi11232367-65
2024-04-11T12:39:43.111 INFO:tasks.rados.rados.0.smithi112.stdout:65: oid 65 updating version 0 to 11
2024-04-11T12:39:43.111 INFO:tasks.rados.rados.0.smithi112.stdout:65: oid 65 version 11 is already newer than 0
2024-04-11T12:39:43.111 INFO:tasks.rados.rados.0.smithi112.stdout:65: oid 65 version 11 is already newer than 10
2024-04-11T12:39:43.111 INFO:tasks.rados.rados.0.smithi112.stdout:update_object_version oid 65 v 11 (ObjNum 64 snap 0 seq_num 64) dirty exists
2024-04-11T12:39:43.111 INFO:tasks.rados.rados.0.smithi112.stderr:Error: racing read on 65 returned version 12 rather than version 11
From osd3 (was killed and revived):
DEBUG 2024-04-11 12:39:40,193 [shard 0:main] osd - read_log_and_missing 18'10 (0'0) modify 3:414dc9d4:::smithi11232367-65:head by client.4281.0:266 2024-04-11T12:39:31.055978+0000 0 ObjectCleanRegions clean_offsets: [(0, 675648), (1222408, 18446744073708329207)], clean_omap: true, new_object: false
DEBUG 2024-04-11 12:39:40,193 [shard 0:main] osd - read_log_and_missing 18'11 (18'11) modify 3:414dc9d4:::smithi11232367-65:head by client.4281.0:267 2024-04-11T12:39:31.065106+0000 0 ObjectCleanRegions clean_offsets: [(0, 2012024), (2580373, 18446744073706971242)], clean_omap: true, new_object: false
DEBUG 2024-04-11 12:39:42,901 [shard 0:main] osd - merge_log log((18'11,18'12], crt=18'13) from osd.1 into log((0'0,18'11], crt=18'11)
DEBUG 2024-04-11 12:39:42,901 [shard 0:main] osd - merge_log extending head to 18'12
DEBUG 2024-04-11 12:39:42,901 [shard 2:main] osd - 0x603000387080 PeeringEvent<T>::complete_rctx: peering_event(id=115, detail=PeeringEvent(from=0 pgid=3.1 sent=21 requested=21 evt=epoch_sent: 21 epoch_requested: 21 MNotifyRec 3.1 from 0 notify: (query:21 sent:21 3.1( v 18'7 (0'0,18'7] local-lis/les=15/16 n=0 ec=15/15 lis/c=15/15 les/c/f=16/16/0 sis=20) ([15,19] all_participants={0, 1, 3} intervals=[([15,18] acting={0, 1, 3})])) features: 0x3f03cffffffdffff +create_info)): submitting ctx
DEBUG 2024-04-11 12:39:42,901 [shard 2:main] osd - OSDSingletonState::dispatch_context_transaction: empty transaction
DEBUG 2024-04-11 12:39:42,902 [shard 0:main] osd - ? 18'12 (18'12) modify 3:414dc9d4:::smithi11232367-65:head by client.4281.0:268 2024-04-11T12:39:31.065171+0000 0 ObjectCleanRegions clean_offsets: [(0, 2580373), (3075648, 18446744073706475967)], clean_omap: true, new_object: false
DEBUG 2024-04-11 12:39:42,902 [shard 0:main] osd - merge_log cut point (usually last shared) is 18'11
DEBUG 2024-04-11 12:39:42,902 [shard 0:main] osd - merge_log original_crt = 18'11
DEBUG 2024-04-11 12:39:42,902 [shard 0:main] osd - update missing, append 18'12 (18'12) modify 3:414dc9d4:::smithi11232367-65:head by client.4281.0:268 2024-04-11T12:39:31.065171+0000 0 ObjectCleanRegions clean_offsets: [(0, 2580373), (3075648, 18446744073706475967)], clean_omap: true, new_object: false
INFO 2024-04-11 12:39:42,902 [shard 0:main] osd - merge_log_dups log.dups.size()=0olog.dups.size()=11
DEBUG 2024-04-11 12:39:42,902 [shard 0:main] osd - merge_log copying olog dups to log 18'1 to 18'11
DEBUG 2024-04-11 12:39:42,902 [shard 0:main] osd - merge_log removed dups overlapping log entries (0'0,18'11]
INFO 2024-04-11 12:39:42,902 [shard 0:main] osd - end of merge_log_dups changed=1 log.dups.size()=0 olog.dups.size()=11
DEBUG 2024-04-11 12:39:42,902 [shard 0:main] osd - merge_log result log((0'0,18'12], crt=18'12) missing(1 may_include_deletes = 1) changed=1
DEBUG 2024-04-11 12:39:42,902 [shard 0:main] osd - pg_epoch 21 pg[3.2( v 18'12 lc 18'11 (0'0,18'12] local-lis/les=15/16 n=4 ec=15/15 lis/c=15/15 les/c/f=16/16/0 sis=20) [1,0,3] r=2 lpr=20 pi=[15,20)/1 crt=18'12 lcod 0'0 mlcod 0'0 unknown NOTIFY PeeringState::proc_lease pg_lease(ru 0.000000000s ub 51.259567261s int 16.000000000s)
DEBUG 2024-04-11 12:39:43,087 [shard 0:main] osd - background_recovery_sub(id=182, detail=MOSDPGPush(3.2 21/20 {PushOp(3:414dc9d4:::smithi11232367-65:head, version: 18'13, data_included: [], data_size: 0, omap_header_size: 0, omap_entries_size: 0, attrset_size: 3, recovery_info: ObjectRecoveryInfo(3:414dc9d4:::smithi11232367-65:head@18'13, size: 3075648, copy_subset: [(2580373, 495275)], clone_subset: {}, snapset: 0=[]:{}, object_exist: true), after_progress: ObjectRecoveryProgress(!first, data_recovered_to: 3075648, data_complete: true, omap_recovered_to: , omap_complete: true, error: false), before_progress: ObjectRecoveryProgress(first, data_recovered_to: 0, data_complete: false, omap_recovered_to: , omap_complete: true, error: false))})): starting start_pg_operation
DEBUG 2024-04-11 12:39:43,089 [shard 0:main] osd - background_recovery_sub(id=182, detail=MOSDPGPush(3.2 21/20 {PushOp(3:414dc9d4:::smithi11232367-65:head, version: 18'13, data_included: [], data_size: 0, omap_header_size: 0, omap_entries_size: 0, attrset_size: 3, recovery_info: ObjectRecoveryInfo(3:414dc9d4:::smithi11232367-65:head@18'13, size: 3075648, copy_subset: [(2580373, 495275)], clone_subset: {}, snapset: 0=[]:{}, object_exist: true), after_progress: ObjectRecoveryProgress(!first, data_recovered_to: 3075648, data_complete: true, omap_recovered_to: , omap_complete: true, error: false), before_progress: ObjectRecoveryProgress(first, data_recovered_to: 0, data_complete: false, omap_recovered_to: , omap_complete: true, error: false))})): have_pg
DEBUG 2024-04-11 12:39:43,090 [shard 0:main] osd - pg_epoch 21 pg[3.2( v 18'12 lc 18'11 (0'0,18'12] local-lis/les=20/21 n=4 ec=15/15 lis/c=20/15 les/c/f=21/16/0 sis=20) [1,0,3] r=2 lpr=20 pi=[15,20)/1 luod=0'0 crt=18'13 lcod 0'0 mlcod 0'0 active PeeringState::recover_got got missing 3:414dc9d4:::smithi11232367-65:head v 18'13
DEBUG 2024-04-11 12:39:43,097 [shard 0:main] osd - 0x603000346ca0 RecoverySubRequest::with_pg: background_recovery_sub(id=182, detail=MOSDPGPush(3.2 21/20 {PushOp(3:414dc9d4:::smithi11232367-65:head, version: 18'13, data_included: [], data_size: 0, omap_header_size: 0, omap_entries_size: 0, attrset_size: 3, recovery_info: ObjectRecoveryInfo(3:414dc9d4:::smithi11232367-65:head@18'13, size: 3075648, copy_subset: [(2580373, 495275)], clone_subset: {}, snapset: 0=[]:{}, object_exist: true), after_progress: ObjectRecoveryProgress(!first, data_recovered_to: 3075648, data_complete: true, omap_recovered_to: , omap_complete: true, error: false), before_progress: ObjectRecoveryProgress(first, data_recovered_to: 0, data_complete: false, omap_recovered_to: , omap_complete: true, error: false))})): complete
DEBUG 2024-04-11 12:40:00,417 [shard 0:main] osd - pg_epoch 29 pg[3.2( v 18'12 (0'0,18'12] local-lis/les=20/21 n=4 ec=15/15 lis/c=20/20 les/c/f=21/21/0 sis=20) [1,0,3] r=2 lpr=20 luod=0'0 crt=18'13 mlcod 0'0 active ScrubScan::scan_object: obj: #3:414dc9d4:::smithi11232367-65:head#
PeeringState::recover_got got missing 3:414dc9d4:::smithi11232367-65:head v 18'13
From osd.1 (serving the read):
DEBUG 2024-04-11 12:39:41,936 [shard 0:main] osd - OSD::handle_peering_op: 3.2 from 3
DEBUG 2024-04-11 12:39:41,936 [shard 0:main] osd - peering_event(id=381, detail=PeeringEvent(from=3 pgid=3.2 sent=20 requested=20 evt=epoch_sent: 20 epoch_requested: 20 MLogRec from 3 log log((0'0,18'11], crt=18'11) pi ([0,0] all_participants={} intervals=[]) +create_info)): starting start_pg_operation
DEBUG 2024-04-11 12:39:41,937 [shard 2:main] osd - peering_event(id=381, detail=PeeringEvent(from=3 pgid=3.2 sent=20 requested=20 evt=epoch_sent: 20 epoch_requested: 20 MLogRec from 3 log log((0'0,18'11], crt=18'11) pi ([0,0] all_participants={} intervals=[]) +create_info)): have_pg
DEBUG 2024-04-11 12:39:41,937 [shard 2:main] osd - 0x6030002dd5e0 PeeringEvent<T>::with_pg: 0x6030002dd5e0 peering_event(id=381, detail=PeeringEvent(from=3 pgid=3.2 sent=20 requested=20 evt=epoch_sent: 20 epoch_requested: 20 MLogRec from 3 log log((0'0,18'11], crt=18'11) pi ([0,0] all_participants={} intervals=[]) +create_info)): pg present
DEBUG 2024-04-11 12:39:41,937 [shard 2:main] osd - do_peering_event handling epoch_sent: 20 epoch_requested: 20 MLogRec from 3 log log((0'0,18'11], crt=18'11) pi ([0,0] all_participants={} intervals=[]) +create_info for pg: 3.2
DEBUG 2024-04-11 12:39:41,937 [shard 2:main] osd - pg_epoch 20 pg[3.2( v 18'12 (0'0,18'12] local-lis/les=15/16 n=4 ec=15/15 lis/c=15/15 les/c/f=16/16/0 sis=20) [1,0,3] r=0 lpr=20 pi=[15,20)/1 crt=18'13 lcod 18'11 mlcod 0'0 peering PeeringState::proc_replica_log proc_replica_log for osd.3: 3.2( v 18'11 (0'0,18'11] local-lis/les=15/16 n=0 ec=15/15 lis/c=15/15 les/c/f=16/16/0 sis=20) log((0'0,18'11], crt=18'11) missing(0 may_include_deletes = 1)
DEBUG 2024-04-11 12:39:41,937 [shard 2:main] osd - proc_replica_log for osd.3: 3.2( v 18'11 (0'0,18'11] local-lis/les=15/16 n=0 ec=15/15 lis/c=15/15 les/c/f=16/16/0 sis=20) log((0'0,18'11], crt=18'11) missing(0 may_include_deletes = 1)
DEBUG 2024-04-11 12:39:41,938 [shard 2:main] osd - merge_log point (usually last shared) is 18'11 (18'11) modify 3:414dc9d4:::smithi11232367-65:head by client.4281.0:267 2024-04-11T12:39:31.065106+0000 0 ObjectCleanRegions clean_offsets: [(0, 2012024), (2580373, 18446744073706971242)], clean_omap: true, new_object: false
DEBUG 2024-04-11 12:39:41,938 [shard 2:main] osd - pg_epoch 20 pg[3.2( v 18'12 (0'0,18'12] local-lis/les=15/16 n=4 ec=15/15 lis/c=15/15 les/c/f=16/16/0 sis=20) [1,0,3] r=0 lpr=20 pi=[15,20)/1 crt=18'13 lcod 18'11 mlcod 0'0 peering PeeringState::proc_replica_log peer osd.3 now 3.2( v 18'11 (0'0,18'11] local-lis/les=15/16 n=0 ec=15/15 lis/c=15/15 les/c/f=16/16/0 sis=20) missing(0 may_include_deletes = 1)
DEBUG 2024-04-11 12:39:41,938 [shard 2:main] osd - pg_epoch 20 pg[3.2( v 18'12 (0'0,18'12] local-lis/les=15/16 n=4 ec=15/15 lis/c=15/15 les/c/f=16/16/0 sis=20) [1,0,3] r=0 lpr=20 pi=[15,20)/1 crt=18'13 lcod 18'11 mlcod 0'0 peering state<Started/Primary/Peering/GetMissing>: still need up_thru update before going active
INFO 2024-04-11 12:39:41,938 [shard 2:main] osd - pg_epoch 20 pg[3.2( v 18'12 (0'0,18'12] local-lis/les=15/16 n=4 ec=15/15 lis/c=15/15 les/c/f=16/16/0 sis=20) [1,0,3] r=0 lpr=20 pi=[15,20)/1 crt=18'13 lcod 18'11 mlcod 0'0 peering exit Started/Primary/Peering/GetMissing 0.045601 3 0.003369
INFO 2024-04-11 12:39:41,938 [shard 2:main] osd - Exiting state: Started/Primary/Peering/GetMissing, entered at 1712839181.892836, 0.003369422 spent on 3 events
DEBUG 2024-04-11 12:39:42,986 [shard 2:main] osd - do_recover_missing check for recovery, 3:414dc9d4:::smithi11232367-65:head
DEBUG 2024-04-11 12:39:42,986 [shard 2:main] osd - do_recover_missing need to wait for recovery, 3:414dc9d4:::smithi11232367-65:head
DEBUG 2024-04-11 12:39:42,986 [shard 2:main] osd - 0x6030004dd440 pg_epoch 21 pg[3.2( v 18'12 (0'0,18'12] local-lis/les=20/21 n=4 ec=15/15 lis/c=20/15 les/c/f=21/16/0 sis=20) [1,0,3] r=0 lpr=20 pi=[15,20)/1 crt=18'13 lcod 18'11 mlcod 0'0 active+degraded : UrgentRecovery::do_recovery: do_recovery
DEBUG 2024-04-11 12:39:42,986 [shard 0:main] osd - peering_event(id=430, detail=PeeringEvent(from=0 pgid=3.d sent=21 requested=21 evt=epoch_sent: 21 epoch_requested: 21 MLeaseAck epoch 21 from osd.0 pg_lease_ack(ruub 51.333808899s))): starting start_pg_operation
INFO 2024-04-11 12:39:42,986 [shard 2:main] osd - recover_missing 3:414dc9d4:::smithi11232367-65:head v 18'12
INFO 2024-04-11 12:39:42,987 [shard 2:main] osd - recover_missing 3:414dc9d4:::smithi11232367-65:head v 18'12, new recovery
DEBUG 2024-04-11 12:39:42,987 [shard 2:main] osd - recover_object: 3:414dc9d4:::smithi11232367-65:head, 18'12
DEBUG 2024-04-11 12:39:42,987 [shard 2:main] osd - maybe_pull_missing_obj: 3:414dc9d4:::smithi11232367-65:head, 18'12
DEBUG 2024-04-11 12:39:43,006 [shard 2:main] osd - prep_push_to_replica: 3:414dc9d4:::smithi11232367-65:head, 18'12
DEBUG 2024-04-11 12:39:43,006 [shard 2:main] osd - calc_head_subsets: 3:414dc9d4:::smithi11232367-65:head clone_overlap {}
DEBUG 2024-04-11 12:39:43,006 [shard 2:main] osd - calc_head_subsets 3:414dc9d4:::smithi11232367-65:head data_subset [(2580373, 495275)]
DEBUG 2024-04-11 12:39:43,006 [shard 2:main] osd - calc_head_subsets 3:414dc9d4:::smithi11232367-65:head -- no clones
DEBUG 2024-04-11 12:39:43,006 [shard 2:main] osd - prep_push: 3:414dc9d4:::smithi11232367-65:head, 18'12
DEBUG 2024-04-11 12:39:43,006 [shard 2:main] osd - prep_push: 3:414dc9d4:::smithi11232367-65:head, 18'12
DEBUG 2024-04-11 12:39:43,006 [shard 2:main] osd - build_push_op 3:414dc9d4:::smithi11232367-65:head @18'13
DEBUG 2024-04-11 12:39:43,006 [shard 2:main] osd - build_push_op 3:414dc9d4:::smithi11232367-65:head @18'13
DEBUG 2024-04-11 12:39:43,006 [shard 2:main] osd - read_metadata_for_push_op, 3:414dc9d4:::smithi11232367-65:head
DEBUG 2024-04-11 12:39:43,006 [shard 2:main] osd - read_metadata_for_push_op, 3:414dc9d4:::smithi11232367-65:head
DEBUG 2024-04-11 12:39:43,100 [shard 0:main] osd - background_recovery_sub(id=461, detail=MOSDPGPushReply(3.2 21/20 {PushReplyOp(3:414dc9d4:::smithi11232367-65:head)})): starting start_pg_operation
DEBUG 2024-04-11 12:39:43,102 [shard 2:main] osd - background_recovery_sub(id=461, detail=MOSDPGPushReply(3.2 21/20 {PushReplyOp(3:414dc9d4:::smithi11232367-65:head)})): have_pg
DEBUG 2024-04-11 12:39:43,102 [shard 2:main] osd - 0x6030002fbdf0 RecoverySubRequest::with_pg: RecoverySubRequest::with_pg: background_recovery_sub(id=461, detail=MOSDPGPushReply(3.2 21/20 {PushReplyOp(3:414dc9d4:::smithi11232367-65:head)}))
DEBUG 2024-04-11 12:39:43,102 [shard 2:main] osd - _handle_push_reply, soid 3:414dc9d4:::smithi11232367-65:head, from 3
DEBUG 2024-04-11 12:39:43,102 [shard 2:main] osd - on_peer_recover: 3:414dc9d4:::smithi11232367-65:head, 18'13 on 3
DEBUG 2024-04-11 12:39:43,103 [shard 2:main] osd - 0x6030002fbdf0 RecoverySubRequest::with_pg: background_recovery_sub(id=461, detail=MOSDPGPushReply(3.2 21/20 {PushReplyOp(3:414dc9d4:::smithi11232367-65:head)})): complete
INFO 2024-04-11 12:39:43,103 [shard 2:main] osd - on_global_recover 3:414dc9d4:::smithi11232367-65:head
DEBUG 2024-04-11 12:39:43,103 [shard 2:main] osd - background_recovery_sub(id=461, detail=MOSDPGPushReply(3.2 21/20 {PushReplyOp(3:414dc9d4:::smithi11232367-65:head)})): exit
DEBUG 2024-04-11 12:39:43,105 [shard 2:main] osd - do_recover_missing check for recovery, 3:414dc9d4:::smithi11232367-65:head
DEBUG 2024-04-11 12:39:43,107 [shard 2:main] osd - do_osd_ops_execute: object 3:414dc9d4:::smithi11232367-65:head - handling op read
DEBUG 2024-04-11 12:39:43,107 [shard 2:main] osd - handling op read on object 3:414dc9d4:::smithi11232367-65:head
DEBUG 2024-04-11 12:39:43,111 [shard 2:main] osd - do_osd_ops_execute: object 3:414dc9d4:::smithi11232367-65:head all operations successful
-
osd.1 calls
do_recover_missing check for recovery, 3:414dc9d4:::smithi11232367-65:head
and schedulesUrgentRecovery
with version 12 (recover_missing 3:414dc9d4:::smithi11232367-65:head v 18'12
). -
osd.1 serves the unexpected read (
handling op read on object 3:414dc9d4:::smithi11232367-65:head
). -
This may be caused due to incrementing
first_non_divergent
:from
proc_replica_log for osd.3: 3.2( v 18'11 (0'0,18'11]
which precedes bymerge_log point (usually last shared) is 18'11 (18'11) modify
:auto first_non_divergent = log.log.rbegin(); while (1) { if (first_non_divergent == log.log.rend()) break; if (first_non_divergent->version <= olog.head) { dout(20) << "merge_log point (usually last shared) is " << *first_non_divergent << dendl; break; } ++first_non_divergent; }
-
osd.3 also extends to version 12
merge_log extending head to 18'12
Reproduced few times, another similar example:
https://pulpito.ceph.com/matan-2024-04-11_11:23:27-crimson-rados-wip-matanb-crimson-replicated-testing-distro-crimson-smithi/7652230
Error racing read got wrong version on smithi09032091-55:
From osd.2 (was killed and revived):
From osd.0 (serving the read):
Change from 15 to 16: