Skip to content

Instantly share code, notes, and snippets.

@rincebrain
Last active May 5, 2024 01:22
Show Gist options
  • Save rincebrain/e23b4a39aba3fadc04db18574d30dc73 to your computer and use it in GitHub Desktop.
Save rincebrain/e23b4a39aba3fadc04db18574d30dc73 to your computer and use it in GitHub Desktop.

So you want to understand what's going on with all these reports of "OpenZFS might be eating my data".

Here's a simple explanation of what is and isn't true.

A bunch of software uses "where are there holes in the file" as an optimization to avoid having to read large swathes of zeroes. Many filesystems implement a call like FIEMAP which tells you a list of "these regions of the file have data", in addition to or rather than SEEK_HOLE/DATA, which are extensions to lseek to tell you where the next hole/data is from a position in the file.

OpenZFS, back in 2013, implemented support for SEEK_HOLE/SEEK_DATA, and that went into release 0.6.2.

On ZFS, because it has things like compression which might decide that your large swathe of zeroes was actually a hole in the file and store it accordingly, it requires flushing out all the pending dirty stuff to disk for a file in order to know accurately if something has holes or not, and where.

ZFS implemented a check for "if this thing is dirty, force a flush" for using SEEK_HOLE/DATA on things.

Unfortunately, it turns out the "is this thing dirty" check was incomplete, so sometimes it could decide a thing wasn't dirty, when it was, and not force a flush, and give you old information about the file's contents if you hit a very narrow window between flushing one thing and not another.

If you actually tried reading it, that would be correct, but if you skipped reading parts of it at all because you thought they were empty, well, then your program has incorrect ideas about what the contents was, and might write those ideas, or modify the empty areas and write the result, out somewhere.

So if you, say, reproduced this with cp on a file, you might get regions of zeroes where none existed on the destination, no matter what filesystem, if the source was ZFS and you hit this bug.

But if you were, say, compiling something, and the linker got zeroes for some of the objects, the output might not be just zeroes where it found something wrong, because it's not just copying the things it reads in, it's doing things to them and outputting a result.

So you can't just say "if I have regions of zeros, this is mangled" because plenty of files legitimately have large regions of zeroes. You also can't just say "if I have no regions of zeroes, this isn't mangled", because again, programs often read files and do things with the contents, not just write them out.

This isn't just a problem with using cp, or any particular program.

The good news is, this is very hard to hit without contrived examples, which is why it was around for so long - we hit a number of cases where someone made the window where it was incorrect much wider, those got noticed very fast and undone, and we couldn't reproduce it with those contrived examples afterward.

It's also not very common because GNU coreutils just started using things like this by default with 9.0+, though that's just for the trivial case of using cp, things that read files outside of coreutils might be doing god knows what.

So your best bet is if you have a known good source copy or hash from one, to compare your stuff against that. Anything else is going to be a heuristic with false positives and negatives.

Yes, that sucks. But life is messy sometimes.

@gwr
Copy link

gwr commented Nov 29, 2023

In all the (many, many) comments on this problem, I've yet to see a clear statement of what change introduced this problem?
Does anyone know?

@rincebrain
Copy link
Author

That's a complicated question.

The flaw, as written, was arguably in the original commit adding ZFS support in Solaris. (Maybe the expectations were different back then and it wasn't written down well? Dunno.)

But in practice, you couldn't hit this without something that would care about the distinction between the dnode being dirty and the buffers representing the modified contents being dirty, so then you can't hit it at least until 802e7b5fe (committed 2013, in 0.6.2 in April 2015), and even then, the code syncs txgs so often that at least in my quick experiments, it doesn't reproduce reliably until 905edb405d (committed 2015, in 0.6.5 in September 2015), which is probably the oldest example of "oops the gap got wider".

And that's just rather quick excavation, there's probably lots more details there, and I'm eliding a lot myself, like how there were 3 or 4 different iterations of checking before that commit I linked that don't matter directly here because they were all before you have any way of something caring.

@alpha754293
Copy link

alpha754293 commented Dec 1, 2023

@rincebrain
So....in the article published by The Register, it cited @Bronek where he stated:

"...as Bronek Kozicki spells out on GitHub:

You need to understand the mechanism that causes corruption. It might have been there for decade and only caused issues in a very specific scenarios, which do not normally happen. Unless you can match your backup mechanism to the conditions described below, you are very unlikely to have been affected by it.

a file is being written to (typically it would be asynchronously – meaning the write is not completed at the time when writing process "thinks" it is)
at the same time when ZFS is still writing the data, the modified part of file is being read from. The same time means "hit a very specific time", measured in microseconds (that's millionth of a second), wide window. Admittedly, as a non-developer for ZFS project, I do not know if using HDD as opposed to SSD would extend that time frame.
if it is being read at this very specific moment, the reader will see zeros where the data being written is actually something else
if the reader then stores the incorrectly read zeroes somewhere else, that's where the data is being corrupted" (Source: https://www.theregister.com/2023/11/27/openzfs_2_2_0_data_corruption/)

-- is what he wrote not exactly true then?

My thinking other thought about this was that because @Bronek mentioned speciifically about asynchronous writes -- that if we set the zfs sync property to always, that should, in theory, either significantly reduce the chances of this happen or eliminate it entirely.

edit
I ask about Solairs ZFS (even though you have already wrote a one-liner comment about it) because I am confused -- if it was part of the original Solaris ZFS code, then wouldn't the commit that you reference above -- the original Solaris ZFS may not have had or would not have had this commit, which led to this issue, correct? So...I'm a little bit confused as to whether this issue manifests in the original Solaris ZFS code or if it was a result of the 2013 commit that you referenced above. Thank you.

Would this assumption of mine be false?

I wonder if the internet archive would have an archived copy the original ZFS source code from the OpenSolaris project (so that it can be compared as to whether this issue existed back then or if this issue as you stated, started since the 2013 commit, as you mentioned, of the OpenZFS code (which may potentially suggest that if that commit wasn't adopted by Oracle ZFS, then Oracle ZFS is potentially in the clear)?

Are you able to comment on that? I'm just curious.

Thank you.

@rincebrain
Copy link
Author

You can reproduce this on current illumos and FreeBSD, modulo the new OpenZFS release and whatever illumos decides to do. You could also reproduce this on FreeBSD's pre-OpenZFS ZFS support, except they hit a bug that looks very familiar now and disabled SEEK_HOLE support in 12.x.

@robn
Copy link

robn commented Dec 1, 2023

My thinking other thought about this was that because @Bronek mentioned speciifically about asynchronous writes -- that if we set the zfs sync property to always, that should, in theory, either significantly reduce the chances of this happen or eliminate it entirely.

I think this was partly that our understanding of the problem was still developing at the time, and partly that this is prose, not technical documentation. Maybe read "concurrent" rather than asynchronous. But no, sync=always is unrelated, and won't affect the timing in any predictable way. If you can't get an updated ZFS (2.2.2, 2.1.14, or a patch from your vendor) then dmu_offset_next_sync=0 is the next best thing.

@alpha754293
Copy link

@rincebrain
Thank you.

Yeah, I am still trying to see if I can test this on Oracle Solaris 10 1/13 rather than Ilumos as a part of my risk assessment for prevalency of the issue.

Some reports state that this bug was just amplified and therefore; exposed this underlying issue, and therefore; isn't related to the block cloning feature that OpenZFS 2.2.0 introduced, whilst others have theorised that this issue may (potentially) have been there since the very beginning, but that theory does not appear to have any data nor evidence attached to it, which demonstrates or confirms that hypothesis.

@robn
The way that @Bronek wrote that originally -- it made it sound like that for this error state to occur, it would have to be in the middle of a write or preparing for a write, and then reading that prep on the assumption that it was the intended data to be written, and then by reading the wrong data, the system thinks that it's the correct data and proceeds with this wrong data.

(Again -- I'm not a developer, so for a grossly underqualified sysadmin -- this was what I got from those remarks.)

On the surface -- that explanation seems plausible as to why this was happening. (I did read your posts about the fixes that are in the process of being merged and cascaded through to the vendors.)

It is also difficult to assess a) if this issue exists with Sun ZFS and b) if it does, the severity of the issue since Oracle closed the ZFS source down after they acquired Sun Microsystems. (Sun ZFS versioning is different than OpenZFS versioning.)

@robn
Copy link

robn commented Dec 1, 2023

@alpha754293

Some reports state that this bug was just amplified and therefore; exposed this underlying issue, and therefore; isn't related to the block cloning feature that OpenZFS 2.2.0 introduced

Its not directly related. The "write" part of the file change can be anything - write, clone, fill, etc. What may make a difference is the relative speed of those operations, and obviously a clone is much faster than a write. There may also be a second bug in cloning that contributed to the this that we haven't found yet. This is part of the reason that cloning is still disabled in 2.2.2.

whilst others have theorised that this issue may (potentially) have been there since the very beginning, but that theory does not appear to have any data nor evidence attached to it, which demonstrates or confirms that hypothesis.

While it hasn't been experimentally verified, the conceptual error that led to the problem appears to have been introduced in 2006. That doesn't mean this version is affected, as the bug comes from a combination of events occuring at just the right moments, and those other things may not be present. So I wouldn't say it has "no data nor evidence", but rather, circumstantial evidence that means it can't be ruled out.

The way that @Bronek wrote that originally -- it made it sound like that for this error state to occur, it would have to be in the middle of a write or preparing for a write, and then reading that prep on the assumption that it was the intended data to be written, and then by reading the wrong data, the system thinks that it's the correct data and proceeds with this wrong data.

Again: at the time, the bug was not fully understood. That said, @Bronek's description is still pretty close. Even so, sync=always doesn't have anything to do with it - that causes all writes to be written to a journal, so in the event of a crash before the current transaction is committed, the changes can be replayed. The transaction commit stage still happens as normal, and that's the place this bug lives. Again, it will change the timing, but it can't be predicted. Thus, its not an effective workaround.

It is also difficult to assess a) if this issue exists with Sun ZFS and b) if it does, the severity of the issue since Oracle closed the ZFS source down after they acquired Sun Microsystems. (Sun ZFS versioning is different than OpenZFS versioning.)

Only direct testing or asking Oracle can answer that.

@Bronek
Copy link

Bronek commented Dec 1, 2023

The way that @Bronek wrote that originally -- it made it sound like that for this error state to occur, it would have to be in the middle of a write or preparing for a write, and then reading that prep on the assumption that it was the intended data to be written, and then by reading the wrong data, the system thinks that it's the correct data and proceeds with this wrong data.

Again: at the time, the bug was not fully understood. That said, @Bronek's description is still pretty close. Even so, sync=always doesn't have anything to do with it - that causes all writes to be written to a journal, so in the event of a crash before the current transaction is committed, the changes can be replayed. The transaction commit stage still happens as normal, and that's the place this bug lives. Again, it will change the timing, but it can't be predicted. Thus, its not an effective workaround.

Thank you @robn for the explanation. I obviously missed the important detail.

@jsddsfoh
Copy link

jsddsfoh commented Dec 1, 2023

Thanks vor writing this conclusion!
I don't fully understand all of this, so it's maybe a stupid question to ask if this also affects zvols and sending snapshots?

@rincebrain
Copy link
Author

send/recv doesn't care, it only operates on things where everything is flushed out.

I can't immediately think of a way you could make a zvol break with this unless you were like, trying to cp the raw contents of the device, but I'll reply if I come up with something else.

@ollien
Copy link

ollien commented Dec 2, 2023

@rincebrain thanks for putting this post together and for your response to the community. If you don't mind clarifying your above comment to a layuser, does this mean that a resilver would similarly be unaffected? The only way this could happen is with actions that would write individual blocks?

@rincebrain
Copy link
Author

scrub/resilver don't care either.

@alpha754293
Copy link

@robn
"the conceptual error that led to the problem appears to have been introduced in 2006."

Admittedly, I clicked on the link for this, but I have no idea what I am looking at nor what I am looking for.

"That doesn't mean this version is affected, as the bug comes from a combination of events occuring at just the right moments, and those other things may not be present. So I wouldn't say it has "no data nor evidence", but rather, circumstantial evidence that means it can't be ruled out."
Agreed - but this was again, a part of my motivation to test ZFS version 32 in Solaris 10 1/13 install, to hammer it in ad simile to the way that (Open)ZFS is being is being hammered on Illumos/*BSD/Linux distros to test for the prevalence of the issue, outside of the context of OpenZFS.

(i.e. if it was a part of the "original" ZFS code, then it had always been lurking around -- but that it never surfaced.)

But I ran into issues running zhammer.sh in said Solaris 10 1/13 install because some of the commands that are in the zhammer.sh are still rather Linux specific which Solaris doesn't recognise.

"The transaction commit stage still happens as normal, and that's the place this bug lives. Again, it will change the timing, but it can't be predicted. Thus, its not an effective workaround."
I appreciate your insight in regards to this. Thank you.

"Only direct testing or asking Oracle can answer that."
Yeah, I tried that - no cigar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment