Skip to content

Instantly share code, notes, and snippets.

@Matan-B
Last active May 21, 2023 11:38
Show Gist options
  • Save Matan-B/f6805f18576659501fe30d111df7b904 to your computer and use it in GitHub Desktop.
Save Matan-B/f6805f18576659501fe30d111df7b904 to your computer and use it in GitHub Desktop.
3329: oid 103 version is 35 and expected 381

Testing:

ceph_test_rados --balance-reads --max-ops 400000 --objects 1024 --max-in-flight 64 --size 4000000 --min-stride-size 400000 --max-stride-size 800000 --max-seconds 600 --op read 100 --op write 50 --op delete 50 --op snap_create 50 --op snap_remove 0 --op rollback 0 --op setattr 25 --op rmattr 25 --op copy_from 0 --op write_excl 50 --pool unique_pool_0

Failure traceback:

update_object_version oid 785 v 593 (ObjNum 2048 snap 305 seq_num 2048) dirty exists
3327:  left oid 785 (ObjNum 2048 snap 305 seq_num 2048)
3330:  finishing write tid 2 to folio031017823-831
3329: oid 103 version is 35 and expected 381
../src/test/osd/RadosModel.h: In function 'virtual void ReadOp::_finish(TestOp::CallbackInfo*)' thread 7efd96ad4700 time 2023-02-13T15:26:02.548504+0000                                     
../src/test/osd/RadosModel.h: 1544: FAILED ceph_assert(version == old_value.version)
 ceph version 18.0.0-2159-g0da3dc0d192 (0da3dc0d19225d3efce468dbb73110213e2bbad5) reef (dev)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x11c) [0x7efda4764deb]                                                                                            
 2: (ceph::register_assert_context(ceph::common::CephContext*)+0) [0x7efda4764ff2]
 3: (ReadOp::_finish(TestOp::CallbackInfo*)+0x426) [0x440f0e]
 4: (TestOp::finish(TestOp::CallbackInfo*)+0xa) [0x457052]
 5: (read_callback(void*, void*)+0x11) [0x457068]

oid 103 ops:

103: write initial oid 103                                                      
103:  seq_num 102 ranges {514314=787987,1768928=405354,2768204=509510}                                                                                  
103:  writing folio031017823-103 from 514314 to 1302301 tid 1                       
103:  writing folio031017823-103 from 1768928 to 2174282 tid 2                              
103:  writing folio031017823-103 from 2768204 to 3277714 tid 3 
103:  finishing write tid 1 to folio031017823-103                     
103:  finishing write tid 2 to folio031017823-103                     
103:  finishing write tid 3 to folio031017823-103             
103:  finishing write tid 4 to folio031017823-103             
103:  finishing write tid 5 to folio031017823-103                                                                                                                                            
103:  oid 103 updating version 0 to 35                                
103:  oid 103 version 35 is already newer than 35            
103:  oid 103 version 35 is already newer than 35             
103:  oid 103 version 35 is already newer than 32                                                                                                                                            
      update_object_version oid 103 v 35 (ObjNum 102 snap 0 seq_num 102) dirty exists
103:  left oid 103 (ObjNum 102 snap 0 seq_num 102)

1525: read oid 103 snap 4
1525:  expect (ObjNum 102 snap 0 seq_num 102)

2532: setattr oid 103 current snap is 202
      update_object_version oid 103 v 381 (ObjNum 102 snap 0 seq_num 102) dirty exists

3329: read oid 103 snap -1 
3329:  expect (ObjNum 102 snap 0 seq_num 102)                          
3329: oid 103 version is 35 and expected 381 

osd.1 logs

  • 2:abb5cdf3:::folio031017823-103:head is still referenced and not being removed.
DEBUG 2023-02-13 15:24:13,326 [shard 0] osd - handle_rep_op: osd_repop(client.4151.0:8507 2.5 e221/15 2:abb5cdf3:::folio031017823-103:head v 221'382, mlcod=221'382) v3
DEBUG 2023-02-13 15:24:13,326 [shard 0] osd - log_operation
DEBUG 2023-02-13 15:24:13,326 [shard 0] osd - replica_clear_repop_obc clearing 2 entries
DEBUG 2023-02-13 15:24:13,326 [shard 0] osd -  2:abb5cdf3:::folio031017823-103:head get_object_boundary(from): 2:abb5cdf3:::folio031017823-103:0  head version(to): 2:abb5cdf3:::folio031017823-103:head
DEBUG 2023-02-13 15:24:13,326 [shard 0] osd - removed: 0. skipped (aka refrenced): 1
  • Later on, we have a cache hit on 2:abb5cdf3:::folio031017823-103:head serving the (failing) sparse-read.
DEBUG 2023-02-13 15:26:02,414 [shard 0] osd - client_request(id=11366, detail=m=[osd_op(client.4151.0:10575 2.5 2:abb5cdf3:::folio031017823-103:head {sparse-read 0~3714314, omap-get-vals-by keys in=371b, omap-get-keys in=12b, omap-get-vals in=16b, omap-get-header, getxattrs} snapc 0={} ondisk+read+balance_reads+known_if_redirected+supports_pool_eio e324) v8]): in get_obc stage
TRACE 2023-02-13 15:26:02,414 [shard 0] osd -  pg_epoch 324 pg[2.5( v 318'455 (0'0,318'455] local-lis/les=15/16 n=0 ec=15/15 lis/c=15/15 les/c/f=16/16/0 sis=15) [3,0,1] r=2 lpr=15 luod=0'0 ua=0'0 crt=318'455 mlcod 318'455 active  ObjectContextLoader::with_head_obc: object 2:abb5cdf3:::folio031017823-103:head
TRACE 2023-02-13 15:26:02,414 [shard 0] osd -  pg_epoch 324 pg[2.5( v 318'455 (0'0,318'455] local-lis/les=15/16 n=0 ec=15/15 lis/c=15/15 les/c/f=16/16/0 sis=15) [3,0,1] r=2 lpr=15 luod=0'0 ua=0'0 crt=318'455 mlcod 318'455 active  ObjectContextLoader::get_or_load_obc: cache hit on 2:abb5cdf3:::folio031017823-103:head
DEBUG 2023-02-13 15:26:02,417 [shard 0] osd - do_process: serving replica read on oid 2:abb5cdf3:::folio031017823-103:head
DEBUG 2023-02-13 15:26:02,417 [shard 0] osd - do_osd_ops_execute: object 2:abb5cdf3:::folio031017823-103:head - handling op sparse-read
DEBUG 2023-02-13 15:26:02,417 [shard 0] osd - handling op sparse-read on object 2:abb5cdf3:::folio031017823-103:head

from osd.3 logs

When osd.1 should have removed the the obc entry in replica_clear_repop_obc it was still refrenced in osd.3 (at 15:24:13,307

TRACE 2023-02-13 15:24:13,307 [shard 0] osd -  pg_epoch 221 pg[2.5( v 217'380 (0'0,217'380] local-lis/les=15/16 n=106 ec=15/15 lis/c=15/15 les/c/f=16/16/0 sis=15) [3,0,1] r=0 lpr=15 lua=0'0 crt=217'380 mlcod 217'380 active+clean  ObjectContextLoader::with_head_obc: object 2:abb5cdf3:::folio031017823-103:head

And only released at: 15:24:13,356

TRACE 2023-02-13 15:24:13,356 [shard 0] osd -  pg_epoch 221 pg[2.5( v 221'382 (0'0,221'382] local-lis/les=15/16 n=107 ec=15/15 lis/c=15/15 les/c/f=16/16/0 sis=15) [3,0,1] r=0 lpr=15 lua=0'0 crt=221'382 mlcod 221'382 active+clean  ObjectContextLoader::with_head_obc: released object 2:abb5cdf3:::folio031017823-103:head

osd.1 attempt to remove was in 15:24:13,326 and therfore was skipped!

@Matan-B
Copy link
Author

Matan-B commented Feb 14, 2023

(New run)
Both osd.1 and osd.3 are in get_obc on the same pg(2.a)? (2:5954f7d4:::folio031142590-146:head)

osd.1:

DEBUG 2023-02-14 14:30:28,617 [shard 0] osd - client_request(id=5905, detail=m=[osd_op(client.4167.0:6409 2.a2:5954f7d4:::folio031142590-146:head {setxattr _header (58) in=65b, truncate 2302889} snapc 60={60, 5f, 5e, 5d, 5c, 5b, 5a, 59, 58, 57, 56, 55, 54, 53, 52, 51, 50, 4f, 4e, 4d, 4c, 4b, 4a, 49, 48, 47, 46, 45, 44, 43, 42, 41, 40, 3f, 3e, 3d, 3c, 3b, 3a, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 2f, 2e, 2d, 2c, 2b, 2a, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 1f, 1e, 1d, 1c, 1b, 1a, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, f, e, d, c, b, a, 9, 8, 7, 6, 5, 4, 3, 2} ondisk+write+known_if_redirected+supports_pool_eio e114) v8]): in get_obc stage

osd.3:

DEBUG 2023-02-14 14:30:28,620 [shard 0] osd - replicated_request(id=5612, detail=RepRequest(from=1 req=osd_repop(client.4167.0:6409 2.a e114/15 2:5954f7d4:::folio031142590-146:head v 114'359, mlcod=114'359) v3)): in get_obc stage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment