Skip to content

Instantly share code, notes, and snippets.

@rincebrain
Last active May 5, 2024 01:22
Show Gist options
  • Star 9 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rincebrain/e23b4a39aba3fadc04db18574d30dc73 to your computer and use it in GitHub Desktop.
Save rincebrain/e23b4a39aba3fadc04db18574d30dc73 to your computer and use it in GitHub Desktop.

So you want to understand what's going on with all these reports of "OpenZFS might be eating my data".

Here's a simple explanation of what is and isn't true.

A bunch of software uses "where are there holes in the file" as an optimization to avoid having to read large swathes of zeroes. Many filesystems implement a call like FIEMAP which tells you a list of "these regions of the file have data", in addition to or rather than SEEK_HOLE/DATA, which are extensions to lseek to tell you where the next hole/data is from a position in the file.

OpenZFS, back in 2013, implemented support for SEEK_HOLE/SEEK_DATA, and that went into release 0.6.2.

On ZFS, because it has things like compression which might decide that your large swathe of zeroes was actually a hole in the file and store it accordingly, it requires flushing out all the pending dirty stuff to disk for a file in order to know accurately if something has holes or not, and where.

ZFS implemented a check for "if this thing is dirty, force a flush" for using SEEK_HOLE/DATA on things.

Unfortunately, it turns out the "is this thing dirty" check was incomplete, so sometimes it could decide a thing wasn't dirty, when it was, and not force a flush, and give you old information about the file's contents if you hit a very narrow window between flushing one thing and not another.

If you actually tried reading it, that would be correct, but if you skipped reading parts of it at all because you thought they were empty, well, then your program has incorrect ideas about what the contents was, and might write those ideas, or modify the empty areas and write the result, out somewhere.

So if you, say, reproduced this with cp on a file, you might get regions of zeroes where none existed on the destination, no matter what filesystem, if the source was ZFS and you hit this bug.

But if you were, say, compiling something, and the linker got zeroes for some of the objects, the output might not be just zeroes where it found something wrong, because it's not just copying the things it reads in, it's doing things to them and outputting a result.

So you can't just say "if I have regions of zeros, this is mangled" because plenty of files legitimately have large regions of zeroes. You also can't just say "if I have no regions of zeroes, this isn't mangled", because again, programs often read files and do things with the contents, not just write them out.

This isn't just a problem with using cp, or any particular program.

The good news is, this is very hard to hit without contrived examples, which is why it was around for so long - we hit a number of cases where someone made the window where it was incorrect much wider, those got noticed very fast and undone, and we couldn't reproduce it with those contrived examples afterward.

It's also not very common because GNU coreutils just started using things like this by default with 9.0+, though that's just for the trivial case of using cp, things that read files outside of coreutils might be doing god knows what.

So your best bet is if you have a known good source copy or hash from one, to compare your stuff against that. Anything else is going to be a heuristic with false positives and negatives.

Yes, that sucks. But life is messy sometimes.

@robn
Copy link

robn commented Dec 1, 2023

@alpha754293

Some reports state that this bug was just amplified and therefore; exposed this underlying issue, and therefore; isn't related to the block cloning feature that OpenZFS 2.2.0 introduced

Its not directly related. The "write" part of the file change can be anything - write, clone, fill, etc. What may make a difference is the relative speed of those operations, and obviously a clone is much faster than a write. There may also be a second bug in cloning that contributed to the this that we haven't found yet. This is part of the reason that cloning is still disabled in 2.2.2.

whilst others have theorised that this issue may (potentially) have been there since the very beginning, but that theory does not appear to have any data nor evidence attached to it, which demonstrates or confirms that hypothesis.

While it hasn't been experimentally verified, the conceptual error that led to the problem appears to have been introduced in 2006. That doesn't mean this version is affected, as the bug comes from a combination of events occuring at just the right moments, and those other things may not be present. So I wouldn't say it has "no data nor evidence", but rather, circumstantial evidence that means it can't be ruled out.

The way that @Bronek wrote that originally -- it made it sound like that for this error state to occur, it would have to be in the middle of a write or preparing for a write, and then reading that prep on the assumption that it was the intended data to be written, and then by reading the wrong data, the system thinks that it's the correct data and proceeds with this wrong data.

Again: at the time, the bug was not fully understood. That said, @Bronek's description is still pretty close. Even so, sync=always doesn't have anything to do with it - that causes all writes to be written to a journal, so in the event of a crash before the current transaction is committed, the changes can be replayed. The transaction commit stage still happens as normal, and that's the place this bug lives. Again, it will change the timing, but it can't be predicted. Thus, its not an effective workaround.

It is also difficult to assess a) if this issue exists with Sun ZFS and b) if it does, the severity of the issue since Oracle closed the ZFS source down after they acquired Sun Microsystems. (Sun ZFS versioning is different than OpenZFS versioning.)

Only direct testing or asking Oracle can answer that.

@Bronek
Copy link

Bronek commented Dec 1, 2023

The way that @Bronek wrote that originally -- it made it sound like that for this error state to occur, it would have to be in the middle of a write or preparing for a write, and then reading that prep on the assumption that it was the intended data to be written, and then by reading the wrong data, the system thinks that it's the correct data and proceeds with this wrong data.

Again: at the time, the bug was not fully understood. That said, @Bronek's description is still pretty close. Even so, sync=always doesn't have anything to do with it - that causes all writes to be written to a journal, so in the event of a crash before the current transaction is committed, the changes can be replayed. The transaction commit stage still happens as normal, and that's the place this bug lives. Again, it will change the timing, but it can't be predicted. Thus, its not an effective workaround.

Thank you @robn for the explanation. I obviously missed the important detail.

@jsddsfoh
Copy link

jsddsfoh commented Dec 1, 2023

Thanks vor writing this conclusion!
I don't fully understand all of this, so it's maybe a stupid question to ask if this also affects zvols and sending snapshots?

@rincebrain
Copy link
Author

send/recv doesn't care, it only operates on things where everything is flushed out.

I can't immediately think of a way you could make a zvol break with this unless you were like, trying to cp the raw contents of the device, but I'll reply if I come up with something else.

@ollien
Copy link

ollien commented Dec 2, 2023

@rincebrain thanks for putting this post together and for your response to the community. If you don't mind clarifying your above comment to a layuser, does this mean that a resilver would similarly be unaffected? The only way this could happen is with actions that would write individual blocks?

@rincebrain
Copy link
Author

scrub/resilver don't care either.

@alpha754293
Copy link

@robn
"the conceptual error that led to the problem appears to have been introduced in 2006."

Admittedly, I clicked on the link for this, but I have no idea what I am looking at nor what I am looking for.

"That doesn't mean this version is affected, as the bug comes from a combination of events occuring at just the right moments, and those other things may not be present. So I wouldn't say it has "no data nor evidence", but rather, circumstantial evidence that means it can't be ruled out."
Agreed - but this was again, a part of my motivation to test ZFS version 32 in Solaris 10 1/13 install, to hammer it in ad simile to the way that (Open)ZFS is being is being hammered on Illumos/*BSD/Linux distros to test for the prevalence of the issue, outside of the context of OpenZFS.

(i.e. if it was a part of the "original" ZFS code, then it had always been lurking around -- but that it never surfaced.)

But I ran into issues running zhammer.sh in said Solaris 10 1/13 install because some of the commands that are in the zhammer.sh are still rather Linux specific which Solaris doesn't recognise.

"The transaction commit stage still happens as normal, and that's the place this bug lives. Again, it will change the timing, but it can't be predicted. Thus, its not an effective workaround."
I appreciate your insight in regards to this. Thank you.

"Only direct testing or asking Oracle can answer that."
Yeah, I tried that - no cigar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment