Skip to content

Instantly share code, notes, and snippets.

@NitzanMordhai
Last active December 1, 2022 12:50
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save NitzanMordhai/8e1d64fcaa782868cbf68b668eabd4a1 to your computer and use it in GitHub Desktop.
Save NitzanMordhai/8e1d64fcaa782868cbf68b668eabd4a1 to your computer and use it in GitHub Desktop.

Tracker: https://tracker.ceph.com/issues/57757

Osd.9 showing:

ceph version 18.0.0-131-gb2711ca9 (b2711ca9ca68627a0666c08c8c65c3a27a2da3f2) quincy (dev)
 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420) [0x7f06c745d420]
 2: gsignal()
 3: abort()
 4: ceph-osd(+0xd0d130) [0x55b3c6c14130]
 5: (__cxxabiv1::__terminate(void (*)())+0xc) [0x55b3c7e4770c]
 6: ceph-osd(+0x1f40777) [0x55b3c7e47777]
 7: ceph-osd(+0x1f408d9) [0x55b3c7e478d9]
 8: ceph-osd(+0xce37c1) [0x55b3c6bea7c1]
 9: (ECUtil::decode(ECUtil::stripe_info_t const&, std::shared_ptr<ceph::ErasureCodeInterface>&, std::map<int, ceph::buffer::v15_2_0::list, std::less<int>, std::allocator<std::pair<int const, ceph::buffer::v15_2_0::list> > >&, std::map<int, ceph::buffer::v15_2_0::list*, std::less<int>, std::allocator<std::pair<int const, ceph::buffer::v15_2_0::list*> > >&)+0x49b) [0x55b3c6f831cb]
 10: (ECBackend::handle_recovery_read_complete(hobject_t const&, boost::tuples::tuple<unsigned long, unsigned long, std::map<pg_shard_t, ceph::buffer::v15_2_0::list, std::less<pg_shard_t>, std::allocator<std::pair<pg_shard_t const, ceph::buffer::v15_2_0::list> > >, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type, boost::tuples::null_type>&, std::optional<std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, ceph::buffer::v15_2_0::list, std::less<void>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, ceph::buffer::v15_2_0::list> > > >, RecoveryMessages*)+0x431) [0x55b3c70db3e1]
 11: (OnRecoveryReadComplete::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x71) [0x55b3c7100981]
 12: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x8d) [0x55b3c70cc9ad]
 13: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*, ZTracer::Trace const&)+0xdc2) [0x55b3c70e4032]
 14: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x2d4) [0x55b3c70e4894]
 15: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x56) [0x55b3c6e9c726]
 16: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x4d2) [0x55b3c6e38682]
 17: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1c0) [0x55b3c6c93860]
 18: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x69) [0x55b3c6f87629]
 19: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xb28) [0x55b3c6cacf38]
 20: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x434) [0x55b3c73d7f14]
 21: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55b3c73db004]
 22: /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7f06c7451609]
 23: clone()

ECUtil::decode failed with 'ceph::buffer::v15_2_0::end_of_buffer’ when we are trying to substr_of the to_decode chunks that returned from all shards:

for (int i = 0; i < chunks_count; i++) {
   map<int, bufferlist> chunks;
   for (auto j = to_decode.begin();
        j != to_decode.end();
        ++j) {
     chunks[j->first].substr_of(j->second,
                                i*repair_data_per_chunk,
                                repair_data_per_chunk);
   }

To_decode list looks like that: [0,262144, 1,258048, 2,258048, 3,258048, 4,258048, 5,258048]

We expect that all shards will return the same chunk length, but as you can see, shard0 returns a buffer list greater by 4K than the other shards which cause the chunks_count calculation to be wrong and when we are trying to substr_of from the first shorter buffer we will hit the end_of_buffer.

The recovery process with tid=706 is the one that returned the wrong buffer length. We can see in osd.9 log that we are sending read requests from all the shards:

2022-09-29T23:32:10.610+0000 7f06a5796700  1 -- [v2:172.21.15.133:6810/37862,v1:172.21.15.133:6811/37862] --> [v2:172.21.15.12:6802/26796,v1:172.21.15.12:6803/26796] -- MOSDECSubOpRead(3.cs1 479/476 ECSubRead(tid=706, to_read={3:307edc38:::smithi01231941-9:head=786432,262144,0}, subchunks={3:307edc38:::smithi0
2022-09-29T23:32:10.610+0000 7f06a5796700  1 -- [v2:172.21.15.133:6810/37862,v1:172.21.15.133:6811/37862] --> [v2:172.21.15.12:6820/26798,v1:172.21.15.12:6822/26798] -- MOSDECSubOpRead(3.cs3 479/476 ECSubRead(tid=706, to_read={3:307edc38:::smithi01231941-9:head=786432,262144,0}, subchunks={3:307edc38:::smithi0
2022-09-29T23:32:10.610+0000 7f06a5796700  1 -- [v2:172.21.15.133:6810/37862,v1:172.21.15.133:6811/37862] --> [v2:172.21.15.55:6826/41091,v1:172.21.15.55:6827/41091] -- MOSDECSubOpRead(3.cs2 479/476 ECSubRead(tid=706, to_read={3:307edc38:::smithi01231941-9:head=786432,262144,0}, subchunks={3:307edc38:::smithi0
2022-09-29T23:32:10.610+0000 7f06a5796700  1 -- [v2:172.21.15.133:6810/37862,v1:172.21.15.133:6811/37862] --> [v2:172.21.15.55:6808/26878,v1:172.21.15.55:6810/26878] -- MOSDECSubOpRead(3.cs4 479/476 ECSubRead(tid=706, to_read={3:307edc38:::smithi01231941-9:head=786432,262144,0}, subchunks={3:307edc38:::smithi0
2022-09-29T23:32:10.610+0000 7f06a5796700  1 -- [v2:172.21.15.133:6810/37862,v1:172.21.15.133:6811/37862] --> [v2:172.21.15.133:6819/26834,v1:172.21.15.133:6822/26834] -- MOSDECSubOpRead(3.cs5 479/476 ECSubRead(tid=706, to_read={3:307edc38:::smithi01231941-9:head=786432,262144,0}, subchunks={3:307edc38:::smith
2022-09-29T23:32:10.610+0000 7f06a5796700 10 osd.9 pg_epoch: 479 pg[3.cs0( v 477'1140 lc 444'1131 (0'0,477'1140] local-lis/les=476/477 n=9 ec=25/25 lis/c=476/394 les/c/f=477/409/0 sis=476) [9,3,11,7,4,2]p9(0) r=0 lpr=477 pi=[394,476)/1 crt=477'1140 mlcod 444'1131 active+recovering rops=1 m=1 mbc={0={(0+1)=1},1

And getting back (with some delay) shards 1-5, shard0 delay some more and finally return (with the wrong buffer length)

Checking shard0 (osd.0) for tid=706 (object name: smithi01231941-9) 2022-09-29T23:32:10.711+0000 7f80d1a88700 20 osd.0 pg_epoch: 479 pg[3.cs0( v 452'1140 (0'0,452'1140] local-lis/les=394/395 n=10 ec=25/25 lis/c=394/394 les/c/f=395/409/0 sis=476) [9,3,11,7,4,2]p9(0) r=-1 lpr=476 pi=[394,476)/1 luod=0'0 crt=452'1140 lcod 446'1136 mlcod 0'0 active mbc={} ps=[135~1,139~1,13c~1,142~1,149~1,14c~1,14e~1,150~3,158~1,15b~1]] handle_sub_read read request=262144 r=262144 len=262144

Other shard showings: 2022-09-29T23:32:10.610+0000 7f38d8e41700 20 osd.6 pg_epoch: 479 pg[3.cs4( v 450'1138 (0'0,450'1138] local-lis/les=394/395 n=10 ec=25/25 lis/c=394/394 les/c/f=395/409/0 sis=476) [9,3,11,7,4,2]p9(0) r=-1 lpr=476 pi=[394,476)/1 luod=0'0 crt=450'1138 lcod 452'1140 mlcod 0'0 active mbc={} ps=[135~1,139~1,13c~1,142~1,149~1,14c~1,14e~1,150~3,158~1,15b~1]] handle_sub_read read request=262144 r=258048 len=258048

The version is different, osd.0 shows 4521140, but other shards show 4501138

If I’m checking osd.0 for the version change

2022-09-29T23:32:08.143+0000 7f80d1a88700 10 osd.0 pg_epoch: 477 pg[3.cs0( v 452'1140 (0'0,452'1140] local-lis/les=394/395 n=10 ec=25/25 lis/c=394/394 les/c/f=395/409/0 sis=476) [9,3,11,7,4,2]p9(0) r=-1 lpr=476 pi=[394,476)/1 crt=450'1138 lcod 446'1136 mlcod 0'0 unknown NOTIFY mbc={} ps=[135~1,139~1,13c~1,142~1,149~1,14c~1,14e~1,150~3,158~1,15b~1]] state<Started/Stray>: got info from osd.9(0) 3.cs0( v 452'1140 (0'0,452'1140] local-lis/les=394/395 n=10 ec=25/25 lis/c=394/394 les/c/f=395/409/0 sis=476)
2022-09-29T23:32:08.143+0000 7f80d1a88700  5 osd.0 pg_epoch: 477 pg[3.cs0( v 452'1140 (0'0,452'1140] local-lis/les=394/395 n=10 ec=25/25 lis/c=394/394 les/c/f=395/409/0 sis=476) [9,3,11,7,4,2]p9(0) r=-1 lpr=476 pi=[394,476)/1 crt=450'1138 lcod 446'1136 mlcod 0'0 unknown NOTIFY mbc={} ps=[135~1,139~1,13c~1,142~1,149~1,14c~1,14e~1,150~3,158~1,15b~1]] exit Started/Stray 1.220104 7 0.000817
2022-09-29T23:32:08.143+0000 7f80d1a88700  5 osd.0 pg_epoch: 477 pg[3.cs0( v 452'1140 (0'0,452'1140] local-lis/les=394/395 n=10 ec=25/25 lis/c=394/394 les/c/f=395/409/0 sis=476) [9,3,11,7,4,2]p9(0) r=-1 lpr=476 pi=[394,476)/1 crt=450'1138 lcod 446'1136 mlcod 0'0 unknown NOTIFY mbc={} ps=[135~1,139~1,13c~1,142~1,149~1,14c~1,14e~1,150~3,158~1,15b~1]] enter Started/ReplicaActive
2022-09-29T23:32:08.143+0000 7f80d1a88700  5 osd.0 pg_epoch: 477 pg[3.cs0( v 452'1140 (0'0,452'1140] local-lis/les=394/395 n=10 ec=25/25 lis/c=394/394 les/c/f=395/409/0 sis=476) [9,3,11,7,4,2]p9(0) r=-1 lpr=476 pi=[394,476)/1 crt=450'1138 lcod 446'1136 mlcod 0'0 unknown NOTIFY mbc={} ps=[135~1,139~1,13c~1,142~1,149~1,14c~1,14e~1,150~3,158~1,15b~1]] enter Started/ReplicaActive/RepNotRecovering
2022-09-29T23:32:08.143+0000 7f80d1a88700 10 osd.0 pg_epoch: 477 pg[3.cs0( v 452'1140 (0'0,452'1140] local-lis/les=394/395 n=10 ec=25/25 lis/c=394/394 les/c/f=395/409/0 sis=476) [9,3,11,7,4,2]p9(0) r=-1 lpr=476 pi=[394,476)/1 crt=450'1138 lcod 446'1136 mlcod 0'0 unknown NOTIFY mbc={} ps=[135~1,139~1,13c~1,142~1,149~1,14c~1,14e~1,150~3,158~1,15b~1]] state<Started/ReplicaActive>: In ReplicaActive, about to call activate
2022-09-29T23:32:08.143+0000 7f80d1a88700 10 osd.0 pg_epoch: 477 pg[3.cs0( v 452'1140 (0'0,452'1140] local-lis/les=394/395 n=10 ec=25/25 lis/c=394/394 les/c/f=395/409/0 sis=476) [9,3,11,7,4,2]p9(0) r=-1 lpr=476 pi=[394,476)/1 crt=450'1138 lcod 446'1136 mlcod 0'0 unknown mbc={} ps=[135~1,139~1,13c~1,142~1,149~1,14c~1,14e~1,150~3,158~1,15b~1]] activate - no missing, moving last_complete 452'1140 -> 452'1140
2022-09-29T23:32:08.143+0000 7f80d1a88700 20 osd.0 pg_epoch: 477 pg[3.cs0( v 452'1140 (0'0,452'1140] local-lis/les=394/395 n=10 ec=25/25 lis/c=394/394 les/c/f=395/409/0 sis=476) [9,3,11,7,4,2]p9(0) r=-1 lpr=476 pi=[394,476)/1 crt=452'1140 lcod 446'1136 mlcod 0'0 unknown mbc={} ps=[135~1,139~1,13c~1,142~1,149~1,14c~1,14e~1,150~3,158~1,15b~1]] rollforward: entry=452'1139 (444'1132) clone    3:307edc38:::smithi01231941-9:15e by unknown.0.0:0 2022-09-29T23:31:15.093437+0000 0 snaps [15e,15c] ObjectCleanRegions clean_offsets: [0~18446744073709551615], clean_omap: 1, new_object: 0
2022-09-29T23:32:08.143+0000 7f80d1a88700 20 osd.0 pg_epoch: 477 pg[3.cs0( v 452'1140 (0'0,452'1140] local-lis/les=394/395 n=10 ec=25/25 lis/c=394/394 les/c/f=395/409/0 sis=476) [9,3,11,7,4,2]p9(0) r=-1 lpr=476 pi=[394,476)/1 crt=452'1140 lcod 446'1136 mlcod 0'0 unknown mbc={} ps=[135~1,139~1,13c~1,142~1,149~1,14c~1,14e~1,150~3,158~1,15b~1]] rollforward: entry=452'1140 (444'1132) modify   3:307edc38:::smithi01231941-9:head by client.4590.0:4337 0.000000 0 ObjectCleanRegions clean_offsets: [6733824~18446744073702817791], clean_omap: 1, new_object: 0
2022-09-29T23:32:08.143+0000 7f80d1a88700 10 osd.0 pg_epoch: 477 pg[3.cs0( v 452'1140 (0'0,452'1140] local-lis/les=394/395 n=10 ec=25/25 lis/c=394/394 les/c/f=395/409/0 sis=476) [9,3,11,7,4,2]p9(0) r=-1 lpr=476 pi=[394,476)/1 crt=452'1140 lcod 446'1136 mlcod 0'0 unknown mbc={} ps=[135~1,139~1,13c~1,142~1,149~1,14c~1,14e~1,150~3,158~1,15b~1]] state<Started/ReplicaActive>: Activate Finished
2022-09-29T23:32:08.143+0000 7f80d1a88700  6 write_log_and_missing with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615, writeout_from: 4294967295'18446744073709551615, trimmed: , trimmed_dups: , clear_divergent_priors: 0

But other shards showing:

, new_object: 0
2022-09-29T23:32:08.250+0000 7f38d4e39700 20 _merge_object_divergent_entries: keeping 452'1140 (444'1132) modify   3:307edc38:::smithi01231941-9:head by client.4590.0:4337 0.000000 0 ObjectCleanRegions clean_offsets: [6733824~18446744073702817791], clean_omap: 1, new_object: 0
2022-09-29T23:32:08.250+0000 7f38d4e39700 10 _merge_object_divergent_entries: hoid  object_not_in_store: 0
2022-09-29T23:32:08.250+0000 7f38d4e39700 10 _merge_object_divergent_entries: hoid 3:307edc38:::smithi01231941-9:head prior_version: 444'1132 first_divergent_update: 452'1140 last_divergent_update: 452'1140
2022-09-29T23:32:08.250+0000 7f38d4e39700 10 _merge_object_divergent_entries: hoid 3:307edc38:::smithi01231941-9:head has no more recent entries in log
2022-09-29T23:32:08.250+0000 7f38d4e39700 10 _merge_object_divergent_entries: hoid 3:307edc38:::smithi01231941-9:head must be rolled back or recovered, attempting to rollback
2022-09-29T23:32:08.250+0000 7f38d4e39700 10 _merge_object_divergent_entries: hoid 3:307edc38:::smithi01231941-9:head olog_can_rollback_to: 450'1138
2022-09-29T23:32:08.250+0000 7f38d4e39700 10 _merge_object_divergent_entries: hoid 3:307edc38:::smithi01231941-9:head rolling back 452'1140 (444'1132) modify   3:307edc38:::smithi01231941-9:head by client.4590.0:4337 0.000000 0 ObjectCleanRegions clean_offsets: [6733824~18446744073702817791], clean_omap: 1, ne
w_object: 0

and osd.0 show few seconds later:

2022-09-29T23:32:13.103+0000 7f80d1a88700 20 _merge_object_divergent_entries: keeping 452'1139 (444'1132) clone    3:307edc38:::smithi01231941-9:15e by unknown.0.0:0 2022-09-29T23:31:15.093437+0000 0 snaps [15e,15c] ObjectCleanRegions clean_offsets: [0~18446744073709551615], clean_omap: 1, new_object: 0
2022-09-29T23:32:13.103+0000 7f80d1a88700 10 _merge_object_divergent_entries: hoid  object_not_in_store: 0
2022-09-29T23:32:13.103+0000 7f80d1a88700 10 _merge_object_divergent_entries: hoid 3:307edc38:::smithi01231941-9:15e prior_version: 444'1132 first_divergent_update: 452'1139 last_divergent_update: 452'1139
2022-09-29T23:32:13.103+0000 7f80d1a88700 10 _merge_object_divergent_entries: hoid 3:307edc38:::smithi01231941-9:15e has no more recent entries in log
2022-09-29T23:32:13.103+0000 7f80d1a88700 10 _merge_object_divergent_entries: hoid 3:307edc38:::smithi01231941-9:15e prior_version or op type indicates creation, deleting

and only then we are rollback to 1138

Nov 16. 2022 Update: i'm comparing 2 osds logs, osd.0 and osd.1 after we getting peering for smithi01231941-9 osd.0 is a bit different:

2022-09-29T23:31:41.955+0000 7f80d6291700 20 bluestore(/var/lib/ceph/osd/ceph-0) _txc_write_nodes shared_blob 0xc93c is 12 SharedBlob(0x55e32c6aa460 loaded (sbid 0xc93c ref_map(0x11fc000~10000=3)))
2022-09-29T23:31:41.955+0000 7f80d6291700 20 bluestore(/var/lib/ceph/osd/ceph-0) _txc_write_nodes shared_blob 0xc943 is 12 SharedBlob(0x55e32cde49a0 loaded (sbid 0xc943 ref_map(0x44da000~a000=3)))
2022-09-29T23:31:41.955+0000 7f80d6291700 20 bluestore(/var/lib/ceph/osd/ceph-0) _txc_write_nodes shared_blob 0xc942 is 12 SharedBlob(0x55e32ce8f810 loaded (sbid 0xc942 ref_map(0x44ca000~10000=3)))
2022-09-29T23:31:41.955+0000 7f80d6291700 20 bluestore(/var/lib/ceph/osd/ceph-0) _txc_write_nodes shared_blob 0xc990 is 12 SharedBlob(0x55e3313838f0 loaded (sbid 0xc990 ref_map(0x1828000~4000=2)))
2022-09-29T23:31:41.955+0000 7f80d6291700 20 bluestore(/var/lib/ceph/osd/ceph-0) _txc_write_nodes shared_blob 0xc98f is 12 SharedBlob(0x55e3313839d0 loaded (sbid 0xc98f ref_map(0x1818000~10000=2)))
2022-09-29T23:31:41.955+0000 7f80d6291700 20 bluestore(/var/lib/ceph/osd/ceph-0) _txc_write_nodes shared_blob 0xc944 is 12 SharedBlob(0x55e3331938f0 loaded (sbid 0xc944 ref_map(0x44e4000~2000=3)))
2022-09-29T23:31:41.955+0000 7f80d6291700 20 bluestore(/var/lib/ceph/osd/ceph-0) _txc_finalize_kv txc 0x55e332e9a000 allocated 0x[5d40000~19b000] released 0x[]
2022-09-29T23:31:41.955+0000 7f80d6291700 10 bluestore(/var/lib/ceph/osd/ceph-0) _txc_state_proc txc 0x55e332e9a000 prepare
2022-09-29T23:31:41.955+0000 7f80d6291700 10 bluestore(/var/lib/ceph/osd/ceph-0) _txc_aio_submit txc 0x55e332e9a000
2022-09-29T23:31:41.959+0000 7f80d6291700 10 osd.0 pg_epoch: 452 pg[3.cs0( v 452'1140 (0'0,452'1140] local-lis/les=394/395 n=10 ec=25/25 lis/c=394/394 les/c/f=395/409/0 sis=394) [0,1,5,3,6,11]p0(0) r=0 lpr=394 luod=450'1138 lua=450'1138 crt=450'1138 lcod 446'1136 mlcod 446'1136 active+clean ps=[135~1,139~1,13c
~1,142~1,149~1,14c~1,14e~1,150~3,158~1,15b~1]] op_applied version 452'1140
2022-09-29T23:31:41.959+0000 7f80cda80700 20 osd.0 op_wq(4) _process empty q, waiting
2022-09-29T23:31:41.959+0000 7f80d1a88700 20 osd.0 op_wq(4) _process empty q, waiting
2022-09-29T23:31:41.959+0000 7f80e6c5e700 10 bluestore(/var/lib/ceph/osd/ceph-0) _txc_state_proc txc 0x55e332e9a000 aio_wait
2022-09-29T23:31:41.959+0000 7f80e6c5e700 20 bluestore(/var/lib/ceph/osd/ceph-0) _txc_finish_io 0x55e332e9a000
2022-09-29T23:31:41.959+0000 7f80d6291700 10 osd.0 pg_epoch: 452 pg[3.cs0( v 452'1140 (0'0,452'1140] local-lis/les=394/395 n=10 ec=25/25 lis/c=394/394 les/c/f=395/409/0 sis=394) [0,1,5,3,6,11]p0(0) r=0 lpr=394 luod=450'1138 crt=450'1138 lcod 446'1136 mlcod 446'1136 active+clean ps=[135~1,139~1,13c~1,142~1,149~
1,14c~1,14e~1,150~3,158~1,15b~1]] eval_repop repgather(0x55e33b5f0300 452'1140 rep_tid=326 committed?=0 r=0)
2022-09-29T23:31:41.959+0000 7f80d6291700 10 osd.0 pg_epoch: 452 pg[3.cs0( v 452'1140 (0'0,452'1140] local-lis/les=394/395 n=10 ec=25/25 lis/c=394/394 les/c/f=395/409/0 sis=394) [0,1,5,3,6,11]p0(0) r=0 lpr=394 luod=450'1138 crt=450'1138 lcod 446'1136 mlcod 446'1136 active+clean ps=[135~1,139~1,13c~1,142~1,149~
1,14c~1,14e~1,150~3,158~1,15b~1]] requeue_op_blocked_by_object 3:307edc38:::smithi01231941-9:head requeuing 1 requests
2022-09-29T23:31:41.959+0000 7f80d6291700 20 osd.0 pg_epoch: 452 pg[3.cs0( v 452'1140 (0'0,452'1140] local-lis/les=394/395 n=10 ec=25/25 lis/c=394/394 les/c/f=395/409/0 sis=394) [0,1,5,3,6,11]p0(0) r=0 lpr=394 luod=450'1138 crt=450'1138 lcod 446'1136 mlcod 446'1136 active+clean ps=[135~1,139~1,13c~1,142~1,149~
1,14c~1,14e~1,150~3,158~1,15b~1]] requeue_op 0x55e32ccc6600
2022-09-29T23:31:41.959+0000 7f80d6291700 20 osd.0 op_wq(4) _enqueue_front OpSchedulerItem(3.cs0 PGOpItem(op=osd_op(client.4590.0:4338 3.cs0 3:307edc38:::smithi01231941-9:head [stat] snapc 0=[] ondisk+read+rwordered+known_if_redirected+supports_pool_eio e452) v8) prio 63 cost 0 e452)

unlike osd.1:

2022-09-29T23:31:41.951+0000 7f94e5339700 20 bluestore(/var/lib/ceph/osd/ceph-1) _txc_write_nodes shared_blob 0x2d8c is 12 SharedBlob(0x556e7e1144d0 loaded (sbid 0x2d8c ref_map(0x1243a000~10000=3)))
2022-09-29T23:31:41.951+0000 7f94e5339700 20 bluestore(/var/lib/ceph/osd/ceph-1) _txc_write_nodes shared_blob 0x2d8b is 12 SharedBlob(0x556e7e1145b0 loaded (sbid 0x2d8b ref_map(0x1242a000~10000=3)))
2022-09-29T23:31:41.951+0000 7f94e5339700 20 bluestore(/var/lib/ceph/osd/ceph-1) _txc_write_nodes shared_blob 0x2d8a is 12 SharedBlob(0x556e7e1147e0 loaded (sbid 0x2d8a ref_map(0x12420000~a000=3)))
2022-09-29T23:31:41.951+0000 7f94e5339700 20 bluestore(/var/lib/ceph/osd/ceph-1) _txc_write_nodes shared_blob 0x2d89 is 12 SharedBlob(0x556e7e1148c0 loaded (sbid 0x2d89 ref_map(0x12410000~10000=3)))
2022-09-29T23:31:41.951+0000 7f94e5339700 20 bluestore(/var/lib/ceph/osd/ceph-1) _txc_write_nodes shared_blob 0x2d8e is 12 SharedBlob(0x556e7e115dc0 loaded (sbid 0x2d8e ref_map(0x12454000~2000=3)))
2022-09-29T23:31:41.951+0000 7f94e5339700 20 bluestore(/var/lib/ceph/osd/ceph-1) _txc_write_nodes shared_blob 0x2d88 is 12 SharedBlob(0x556e7e1b3420 loaded (sbid 0x2d88 ref_map(0x12400000~10000=3)))
2022-09-29T23:31:41.951+0000 7f94e5339700 20 bluestore(/var/lib/ceph/osd/ceph-1) _txc_finalize_kv txc 0x556e62cc6600 allocated 0x[139a0000~19b000] released 0x[]
2022-09-29T23:31:41.951+0000 7f94e5339700 10 bluestore(/var/lib/ceph/osd/ceph-1) _txc_state_proc txc 0x556e62cc6600 prepare
2022-09-29T23:31:41.951+0000 7f94e5339700 10 bluestore(/var/lib/ceph/osd/ceph-1) _txc_aio_submit txc 0x556e62cc6600
2022-09-29T23:31:41.955+0000 7f94e1331700 20 osd.1 op_wq(4) _process empty q, waiting
2022-09-29T23:31:41.955+0000 7f94e5339700 10 osd.1 pg_epoch: 452 pg[3.cs1( v 452'1140 (0'0,452'1140] local-lis/les=394/395 n=10 ec=25/25 lis/c=394/394 les/c/f=395/409/0 sis=394) [0,1,5,3,6,11]p0(0) r=1 lpr=394 luod=0'0 lua=450'1138 crt=450'1138 lcod 450'1138 mlcod 450'1138 active mbc={} ps=[135~1,139~1,13c~1,1
42~1,149~1,14c~1,14e~1,150~3,158~1,15b~1]] op_applied version 452'1140
2022-09-29T23:31:41.955+0000 7f94e5339700 10 osd.1 452 dequeue_op 0x556e721e7980 finish
2022-09-29T23:31:41.955+0000 7f94e5339700 20 osd.1 op_wq(4) _process empty q, waiting

on osd.1 we can see that op_applied version 452'1140 on osd.0 we can see that requeue_op_blocked_by_object 3:307edc38:::smithi01231941-9:head requeuing 1 requests

the requeue_op_blocked_by_object was introduced by ceph/ceph#40067

Updates Nov 23

OSD.9 (primary) showing that we are trying to rollback smithi01231941-9:15e to 452'1140 but can't rollback came from peers osd.3\7\4\1\11

3709551615], clean_omap: 1, new_object: 0
2022-09-29T23:32:08.138+0000 7f06a5796700 20 _merge_object_divergent_entries: keeping 452'1139 (444'1132) clone    3:307edc38:::smithi01231941-9:15e by unknown.0.0:0 2022-09-29T23:31:15.093437+0000 0 snaps [15e,15c] ObjectCleanRegions clean_offsets: [0~18446744073709551615], clean_omap: 1, new_object: 0
2022-09-29T23:32:08.138+0000 7f06a5796700 10 _merge_object_divergent_entries: hoid  object_not_in_store: 0
2022-09-29T23:32:08.138+0000 7f06a5796700 10 _merge_object_divergent_entries: hoid 3:307edc38:::smithi01231941-9:15e prior_version: 444'1132 first_divergent_update: 452'1139 last_divergent_update: 452'1139
2022-09-29T23:32:08.138+0000 7f06a5796700 10 _merge_object_divergent_entries: hoid 3:307edc38:::smithi01231941-9:15e has no more recent entries in log
2022-09-29T23:32:08.138+0000 7f06a5796700 10 _merge_object_divergent_entries: hoid 3:307edc38:::smithi01231941-9:15e prior_version or op type indicates creation, deleting
2022-09-29T23:32:08.138+0000 7f06a5796700 20 _merge_object_divergent_entries: merging hoid 3:307edc38:::smithi01231941-9:head entries: 452'1140 (444'1132) modify   3:307edc38:::smithi01231941-9:head by client.4590.0:4337 0.000000 0 ObjectCleanRegions clean_offsets: [6733824~18446744073702817791], clean_omap: 1
, new_object: 0
2022-09-29T23:32:08.138+0000 7f06a5796700 20 _merge_object_divergent_entries: keeping 452'1140 (444'1132) modify   3:307edc38:::smithi01231941-9:head by client.4590.0:4337 0.000000 0 ObjectCleanRegions clean_offsets: [6733824~18446744073702817791], clean_omap: 1, new_object: 0
2022-09-29T23:32:08.138+0000 7f06a5796700 10 _merge_object_divergent_entries: hoid  object_not_in_store: 0
2022-09-29T23:32:08.138+0000 7f06a5796700 10 _merge_object_divergent_entries: hoid 3:307edc38:::smithi01231941-9:head prior_version: 444'1132 first_divergent_update: 452'1140 last_divergent_update: 452'1140
2022-09-29T23:32:08.138+0000 7f06a5796700 10 _merge_object_divergent_entries: hoid 3:307edc38:::smithi01231941-9:head has no more recent entries in log
2022-09-29T23:32:08.138+0000 7f06a5796700 10 _merge_object_divergent_entries: hoid 3:307edc38:::smithi01231941-9:head must be rolled back or recovered, attempting to rollback
2022-09-29T23:32:08.138+0000 7f06a5796700 10 _merge_object_divergent_entries: hoid 3:307edc38:::smithi01231941-9:head olog_can_rollback_to: 452'1140
2022-09-29T23:32:08.138+0000 7f06a5796700 10 _merge_object_divergent_entries: hoid 3:307edc38:::smithi01231941-9:head cannot rollback 452'1140 (444'1132) modify   3:307edc38:::smithi01231941-9:head by client.4590.0:4337 0.000000 0 ObjectCleanRegions clean_offsets: [6733824~18446744073702817791], clean_omap: 1,
 new_object: 0

Update Nov 27

As we can see osd.0 asked for log since sending info+missing+log since 452'1140 and got: sending log((452'1140,452'1140], crt=450'1138) missing(0 may_include_deletes = 1)

2]p9(0) r=-1 lpr=476 pi=[394,476)/1 crt=450'1138 lcod 446'1136 mlcod 0'0 unknown NOTIFY mbc={} ps=[135~1,139~1,13c~1,142~1,149~1,14c~1,14e~1,150~3,158~1,15b~1]] state<Started/Stray>: got info from osd.9(0) 3.cs0( v 452'1140 (0'0,452'1140] local-lis/les=394/395 n=10 ec=25/25 lis/c=394/394 les/c/f=395/409/0 sis=476)
2]p9(0) r=-1 lpr=476 pi=[394,476)/1 crt=450'1138 lcod 446'1136 mlcod 0'0 unknown NOTIFY mbc={} ps=[135~1,139~1,13c~1,142~1,149~1,14c~1,14e~1,150~3,158~1,15b~1]] exit Started/Stray 1.220104 7 0.000817
2]p9(0) r=-1 lpr=476 pi=[394,476)/1 crt=450'1138 lcod 446'1136 mlcod 0'0 unknown NOTIFY mbc={} ps=[135~1,139~1,13c~1,142~1,149~1,14c~1,14e~1,150~3,158~1,15b~1]] enter Started/ReplicaActive
2]p9(0) r=-1 lpr=476 pi=[394,476)/1 crt=450'1138 lcod 446'1136 mlcod 0'0 unknown NOTIFY mbc={} ps=[135~1,139~1,13c~1,142~1,149~1,14c~1,14e~1,150~3,158~1,15b~1]] enter Started/ReplicaActive/RepNotRecovering
2]p9(0) r=-1 lpr=476 pi=[394,476)/1 crt=450'1138 lcod 446'1136 mlcod 0'0 unknown NOTIFY mbc={} ps=[135~1,139~1,13c~1,142~1,149~1,14c~1,14e~1,150~3,158~1,15b~1]] state<Started/ReplicaActive>: In ReplicaActive, about to call activate
2]p9(0) r=-1 lpr=476 pi=[394,476)/1 crt=450'1138 lcod 446'1136 mlcod 0'0 unknown mbc={} ps=[135~1,139~1,13c~1,142~1,149~1,14c~1,14e~1,150~3,158~1,15b~1]] activate - no missing, moving last_complete 452'1140 -> 452'1140
2]p9(0) r=-1 lpr=476 pi=[394,476)/1 crt=452'1140 lcod 446'1136 mlcod 0'0 unknown mbc={} ps=[135~1,139~1,13c~1,142~1,149~1,14c~1,14e~1,150~3,158~1,15b~1]] rollforward: entry=452'1139 (444'1132) clone    3:307edc38:::smithi01231941-9:15e by unknown.0.0:0 2022-09-29T23:31:15.093437+0000 0 snaps [15e,15c] ObjectCleanRegions clean_offsets: [0~18446744073709551615], clean_omap: 1, ne
2]p9(0) r=-1 lpr=476 pi=[394,476)/1 crt=452'1140 lcod 446'1136 mlcod 0'0 unknown mbc={} ps=[135~1,139~1,13c~1,142~1,149~1,14c~1,14e~1,150~3,158~1,15b~1]] rollforward: entry=452'1140 (444'1132) modify   3:307edc38:::smithi01231941-9:head by client.4590.0:4337 0.000000 0 ObjectCleanRegions clean_offsets: [6733824~18446744073702817791], clean_omap: 1, new_object: 0
2]p9(0) r=-1 lpr=476 pi=[394,476)/1 crt=452'1140 lcod 446'1136 mlcod 0'0 unknown mbc={} ps=[135~1,139~1,13c~1,142~1,149~1,14c~1,14e~1,150~3,158~1,15b~1]] state<Started/ReplicaActive>: Activate Finished

Other osds (in that case osd.1) asked for full log: sending info+missing+full log and got: sending log((0'0,452'1140], crt=450'1138) missing(0 may_include_deletes = 1)

Update Dec 1: With a lot of help from Sam it appears that during interval 476 OSD.5 went up and replaced OSD.0 that was auth_log_shard, that caused resetart the peering process, but since it is not new interval (changed during the same interval) we will not clear the primary status, which causing osd.0 not call divergent rollback and we end up osd.0 with wrong version.

PR: ceph/ceph#49174 address that issue

@NitzanMordhai
Copy link
Author

NitzanMordhai commented Nov 15, 2022

For some reason osd.0 got peering request for 1140:

2022-09-29T23:32:08.143+0000 7f80d1a88700 20 osd.0 op_wq(4) _process 3.cs0 to_process <OpSchedulerItem(3.cs0 PGPeeringEvent(epoch_sent: 477 epoch_requested: 477 MInfoRec from 9(0) info: 3.cs0( v 452'1140 (0'0,452'1140] local-lis/les=394/395 n=10 ec=25/25 lis/c=394/394 les/c/f=395/409/0 sis=476)) prio 255 cost 10 e477)> waiting <> waiting_peering {}

but other osds got peering request for 1138 and that is the reason it rollforward:

2022-09-29T23:32:08.250+0000 7f38d4e39700 20 osd.6 op_wq(4) _process 3.cs4 to_process <OpSchedulerItem(3.cs4 PGPeeringEvent(epoch_sent: 477 epoch_requested: 477 MInfoRec from 9(0) info: 3.cs0( v 450'1138 (0'0,450'1138] local-lis/les=394/395 n=10 ec=25/25 lis/c=394/394 les/c/f=395/409/0 sis=476)) prio 255 cost 
10 e477)> waiting <> waiting_peering {}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment