rzarzynski/20230314_bug_58707.md Secret

## 20230314_bug_58707.md

      
    Raw
  

              20230314_bug_58707.md
            
          
    Bug #58707: disappearing data from snapshots

Key points


A production cluster with
rich history:

deployed on nautilus,
the problem replicates on pacific,
hit by multiple bugs (see the conversation summary).


Data corruption happens b/c of clone objects disappearing under alive snapshots.
Currently RBD based replicator but without a VM.

Conversation summary

TODO: structure this batter, apply markdown, re-read.
https://tracker.ceph.com/issues/58707#note-5


Something that might or might not be relevant to mention is the history of this cluster. IIRC it was installed with Nautilus. During the upgrade to Octopus, we were bitten by two Bugs (#51619 and #51682). The helpful people of Croit.io assisted us with recovery, and at that time the pool that is now exhibiting these issues was created anew. One of the things that was important then was that we had high pool ID numbers. The affected pool still does, and the device_health_metrics pool does as well (ids 15213 and 15113).
I doubt that the pool IDs matter but erroneously removed SharedBlob entries (https://tracker.ceph.com/issues/51619) certainly sounds like something that would break snapshots. Given that

https://tracker.ceph.com/issues/58707#note-26


Coming back to the nosnaptrim result, is it possible to increase debug logging in such a way that we can know what is being done during snapshot trimming (or any of the processes that are disabled when running with nosnaptrim)? Also, is it possible to enable this logging without impacting performance in a major way? Some impact is expected and allowed of course, but the system will have to remain usable for our customers.
I'm thinking along the following lines:

Enable nosnaptrim on the cluster
Start up several reproducers
Leave them running for a few hours
Increase debug logging of the subsystems involved
Unset nosnaptrim
Wait for issue (which happened within 10 minutes last time)
Disable debug logging

This way we'd only have to run with debug logging for a very short time.
I can also make copies of the snapshots before and after the issue triggers so we have clear information about which blocks were zero'd. This might be more useful than just knowing that the snapshot changed, probably.

https://tracker.ceph.com/issues/58707#note-28


Hi Ilya,
I've managed to make a much simpler (and smaller) reproducer, without any VM involvement. If I create an image manually, fill it with data, make a snapshot, and then either overwrite the original data or discard it, the snapshot changes. This was with an image of 200M. I've also switched from checksumming the entire image at once, to checksumming the image in 4M increments (so a checksum per block) so that I can see which and how many blocks change. The changed blocks all have a new checksum that is the same as a 4MB block of zeroes.
So with this, I'm down to:
rbd -p ssd create -s 200M roel-disk-1
rbd -p ssd map roel-disk-1
dd if=/dev/urandom of=/dev/rbdX
rbd -p ssd snap create roel-disk-1@snapshot
drd if=/dev/urandom of=/dev/rbdX

Hopefully I have time to look at the rest of your advice today or tomorrow.

https://tracker.ceph.com/issues/58707#note-29

This post added logs collected under the reproducer mentioned above.
https://tracker.ceph.com/issues/58707#note-37


Here are the results of the per-block test with listsnaps output.
Note that I had to do a variation of the command that you specified in note #58707-33, because rados -p ssd ls takes more than 15 minutes (I did not let it run to completion). But the results should be similar.
In the results, you'll find the output of the blocks check before and after the issue triggers, both of the image and the snapshot, and also the output of the listsnaps command before and after the issue triggers. They match up: the second listsnaps output doesn't list the clones of exactly the blocks that are returned as zeroes.

https://tracker.ceph.com/issues/58707#note-39


Kicking this over to you as this appears to be either a Bluestore or an OSD issue. To summarize what Roel is seeing: an RBD snapshot gets corrupted some time after a part (or all) of the RBD image is overwritten. The nature of the corruption is that the snapshotted data just disappears: zeroes are returned instead when reading from a snapshot. It seems that the object clones are simply deleted underneath the RBD image even though the snapshot is still live. As expected, setting nosnaptrim prevents the corruption.
This happens both with kernel RBD and librbd but only on one particular pool of one particular cluster. There is some concerning history there (#58707-5) but the pool in question was reportedly re-created from scratch (#58707-10).