Skip to content

Instantly share code, notes, and snippets.

@rincebrain
Last active April 22, 2024 00:40
Show Gist options
  • Star 9 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rincebrain/622ee4991732774037ff44c6768085ab to your computer and use it in GitHub Desktop.
Save rincebrain/622ee4991732774037ff44c6768085ab to your computer and use it in GitHub Desktop.

Extremely WIP

subject to change, opinionated, etc

These are all very much subject to personal opinion and experience, so please, if you want to suggest I change something, don't think I am asserting I am the only authority on the subject. I just got tired of explaining the same things to people.

This is just a collection of thoughts that aren't concrete or robust enough to want to try and get into the OpenZFS wiki, for example, but that I keep needing to summarize for people. I expect that as I link this document more, people are going to start calling me out on anything I'm asserting that is factually incorrect (or that they dislike). I'll try to remember to use strikethrough when I change things instead of just deleting them outright, but you can always go see the document history if you like.

Key notion

  • ZFS really avoids data changing on disk once written. I often phrase this as "ZFS doesn't change the past."
    • This means if you change {checksum, compression, ...} on a dataset, old data keeps the old version unless you rewrite it somehow, either by modifying it or doing a cp-then-rm or send|recv with new properties or w/e.
    • No, there's not a "go rewrite it all for me" command, implementing that would involve extreme levels of suffering.

What is a vdev?

  • Conceptually, a pool is laid out with one or more "virtual devices" (vdevs) under the "root" of the tree representing the pool.
  • A vdev, currently, can consist of:
    • a single data disk or file
    • a mirror node in which the data ends up laid out identically on all the "disk" nodes below it (so address 1234 has the same data on all children of a "mirror" node, and any of the nodes, modulo falling out of sync from e.g. faulting out or being zpool offlined or missing at import or the like, can be used to service reads)
    • a raidz or draid node in which the data is laid out more complicatedly in a data+parity-based configuration, and the data requires reading multiple disks to get back
    • a special node indicates what is otherwise an ordinary single disk/mirror vdev (no parity for you), but exclusively used for metadata/small records. ("dedup" nodes are "special" nodes solely used for the DDT.)
    • (The above are all that's effectively included if someone talks about "ordinary/data vdevs" - e.g. "anything that comes below this in the list is not included". If someone talks about "leaf vdevs", they mean the single disk/files at the bottom-most level of the "tree" this conceptually forms.)
    • a cache node which cannot contain anything but a single disk/file each
    • a log node which can contain a single disk/file or a mirror
    • a hole node is secretly what's left if you do "zpool remove" on a cache/log device, it doesn't show up in zpool status directly, but you can tell it's there with zdb or from noticing a gap in the numbering of the vdevs.
    • an indirect node is similarly what's left when you zpool remove a data vdev.
    • Finally, spares are notably strange in that you can share them across multiple pools (though they can only be actually filling in for a disk in a single pool at once). They are single disk/files only.
  • Got all that? Well, it gets messier, because any time you're doing a zpool replace, you effectively turn a single disk node in your vdev tree into a mirror node with the old and new disks as children until the old one gets detached at replace completion...and if something faults again, this can end up nesting.
  • You cannot, say, make a mirror vdev of raidz vdevs or vice-versa, in general, with the CLI utilities.
    • No comment on whether the underlying code supports it.
    • ztest, I believe, used to exercise many "impossible" configurations to test the underlying code worked with them, but this changed around 2.1 with draid merging.
  • vdevs have a number of individual properties, which you can't really see outside of zdb until version $NEXT.
    • Most memorably, ashift is a per-vdev property.
    • ashift, as a quick refresher, is the smallest unit size the vdev can work with at once - analogous to the sector size of the underlying media.
    • Mixing different ashifts in a pool can have strange outcomes, since your storage efficiency becomes different depending on which vdev data is written to, and it precludes you being able to ever use zpool remove on data vdevs, so always be careful about what the ashift of a new vdev is going to be when adding it.

ashift and you

  • ashift is a per-vdev property set when you created the vdev, that is not changable after the vdev is created, which is the smallest whole allocation you can make on that device, expressed as an integer power of 2.
  • You cannot store a record on a vdev as any size but integer multiples of that unit.
  • This means that, for example, if you have something that was 11.5KB after compression:
    • On an ashift=9 (512B) vdev, you would take up 11.5KB (23 512B allocations) (before any fun with raidz/draid overheads or gang blocks).
    • On an ashift=12 (4KB) vdev, you would take up 12KB (3 4KB allocations).
    • On an ashift=13 (8KB) vdev, you would take up 16KB (2 8KB allocations).
  • And since a lot of ZFS metadata is small objects, you can imagine how this inflates outcomes if you use a larger ashift and lots of tiny records.

vdev layouts

  • Broadly (and avoiding a few complications like special vdevs), IOPs scale with the number of "normal" (e.g. not log/cache/etc) vdevs, not disks in them
  • So an 8-disk pool consisting of one raidz is going to have the IOPs of one disk, approximately, while a pool of 4 2-way mirror vdevs will, broadly, have the IOPs of 4
    • (Well, I believe you could get up to the read IOPs of 8 for that mirror case, depending on how well the scheduler picks disks to dispatch IOs to, but the write IOPs would top out at 4 disks.)
  • For that among other reasons, stripes of mirrors often perform better than raidz or draid
    • ...but of course, most people can't afford the overhead required for 3+-way mirrors to avoid the wrong two disks failing breaking everyone
  • raidzP vdevs are effectively dead if you lose more than P disks to total failure.
  • raidz (and, even moreso, draid) vdevs introduce bloat in how much space it takes to store something, which can lead to very surprising space outcomes. (See here for more detailed explanations and spreadsheets of overhead.)
    • For example, let's say you wanted to write 28k to a 4k sector raidz.

(RAIDZ enforces that you must allocate a multiple of P+1 so that you can't end up with holes that are impossible for the raidz to ever allocate)

(Px are parity blocks, Dx are the actual data you wanted to store, Xx are filler rounding up to meet constraints of P+1-sized allocations)

  • So if it were a 6-disk raidz1, we'd be paying 40k - we'd write across the 6 disks (not necessarily in this order, there's some complicated code in there):
P0 D0 D1 D2 D3 D4
P1 D5 D6 X0
  • If it were a 7-disk raidz2, we'd be paying 48k:
P0 P1 D0 D1 D2 D3 D4
P2 P3 D5 D6 X0
  • Or an 8-disk raidz3, a whopping 64k:
P0 P1 P2 D0 D1 D2 D3 D4
P3 P4 P5 D5 D6 X0 X1 X2
  • I'm not confident enough that I understand how dRAID IO works to make this anything but a vague warning, but random and/or small writes on dRAID are even worse than on RAIDZ. Plan accordingly.
  • dRAID vdevs of layout draidP:N:C:S can only survive up to P disks failing across the entire vdev, because of the games it plays with shuffling which N+P disks make up the raidz-ish stripes that compose the vdev.
    • Yes, really. A draid2:6d:49c:1s can only survive up to any 2 disks out of those 49 falling over without a resilver finishing before the next one is game over.
  • (IMO) dRAID is only useful if:
    • (Always) You actually use the distributed hot spare feature - rebuild times that scale with the number of disks is a drastic difference, and not making use of that is tragic.
    • (Probably always) You're very much doing just highly sequential large IO, unless you're mitigating the IOPs and small IO penalties with a special vdev appropriately configured.
    • (Maybe) You want to make much wider vdevs than is practical with raidz, and have nonzero redundancy, but not the same as you'd get from independent raidz vdevs of the same N+P ratio

Special vdevs

  • Special vdevs are vdevs dedicated solely to being used for metadata, small records, or both.
  • "dedup" vdevs are special vdevs that are solely used for storing dedup metadata.
  • These are not the same implementation as Nexenta's "special" vdevs, so if you find documentation about those, do not remotely think it applies.
  • Special vdevs should, more or less always, be on fast flash of some kind, since one of the intended goals is to mitigate some of the worst caveats about RAIDZ/DRAID (inefficient and slow small and random IO).
  • Special vdevs, if they lose all their redundancy, will result in a dead pool just as much as normal data vdevs would, so you should mirror them as much as you feel is necessary to be comfortable with the failure chance.
  • Special vdevs can be very useful if your workload involves a lot of metadata or small record updates, especially if you would otherwise be writing them to draid or raidz vdevs.
  • The usual way to adjust behavior of special vdevs per dataset is the property special_small_blocks, which tells the pool that in addition to metadata, any record which is, after compression etc, <= that value should go on the special vdev.
  • Note that if the special vdev has <= zfs_special_class_metadata_reserve_pct space free, it will stop storing small data blocks on the special vdev, and only put metadata on them, until such time as this is no longer true.
  • You can adjust which data is going to be put on special vdevs when with the global tunables zfs_ddt_data_is_special, zfs_user_indirect_is_special, and zfs_special_class_metadata_reserve_pct. ** (These should probably become vdev properties in the post-vdev properties era, but that's not in a stable release yet...)
  • zvol data blocks are, at this time, always ineligible for being stored on a special vdev no matter what you set special_small_blocks to - their metadata will get put on there like usual, but the actual data will not.

Misc Dataset Property Tradeoffs

  • With volblocksize or recordsize, you're making similar tradeoffs - the larger the size you use, the better compression will work (since compression operates on single whole records at a time), and the less overhead you'll have in various places that account for things as records, but also, since ZFS uses copy-on-write mechanics for Everything(tm), the larger the size of the data you'll need to copy out, modify, then write to someplace new each time you make a modification - so if you, say, modify 8k at the end/beginning boundary between two records on a 1M recordsize file, you'd need to read 2 MB (if you haven't already), make the modifications, then go allocate and write 2 MB to the pool (assuming no raidz etc).
    • Repeat after me - recordsize=4k (or volblocksize=4k) is almost always a terrible idea - on ashift=12 or above, you can get 0 compression from that, and a lot of overheads scale by number of records. So don't do it. zfs create will even tell you it's a bad idea if you try it - the default used to be 8K, then was raised because people decided the cost/benefit leaned more toward 16K+ for almost every use case.)
  • atime is almost never useful. Use relatime if you think you need it sometimes and atime=off otherwise.
    • relatime only does anything if atime is still set to on, and is ignored otherwise. So you would need both atime=on and relatime=on to get relatime's benefit.
  • Quotas can be useful, but the way they are implemented, they globally throttle write IO to a pool when anyone is writing and near enough to their quota that ZFS estimates they might exceed it in the next transaction group, by forcing txg commit Now(tm). There's a PR to improve the rounding estimate it uses for how close you can get before it does that, but as it stands, it can still be a bad time.
    • I've heard one or two people think aloud about the notion of a quota enforcement type where it doesn't do this, in exchange for allowing people to possibly exceed their quota by some calculated epsilon, but AFAIK nobody's tried implementing it. (Update: apparently some people are actually implementing this In THe Future(tm) after discovering that zvols are effectively calculated as though they're going to ram into the quota a lot, and it bottlenecks you.)
  • zvols, by default, come with a refreservation equal to their size, in order to ensure that if they're snapshotted, there's always enough space available to overwrite the whole zvol once, because if that isn't true, you might get ENOSPC on trying to write to the zvol, and that's just going to be an IO error - and nothing that wants a block device will be happy about that.

Checksums

  • If you want to use dedup, and not have to do a verification pass, you need a dedup-safe checksum, which is currently:
    • sha256, sha512/256, skein
  • If you want to use nopwrite, you get all the above options plus edonr
  • VERY GENERALLY, absent hardware acceleration, things go:
  • HOWEVER, once one of the several PRs which takes advantage of hardware SHA2 support lands, SHA256/SHA512 may become the fastest option by far for people with that hardware (An incomplete list: AMD Zen+, Intel Ice Lake+, ARMv8-A + sha2 extension (sorry Raspberry Pi 3/4)).

Compression

  • Having any compression enabled enables certain optimizations like zero detection, so that's an upside independent of whichever algorithm.
  • LZ4 is sufficiently close to free on almost any hardware that you should just use it by default, even if your data doesn't compress much. (Which is why the default after OpenZFS 2.1 was changed to compression=on.)
  • Broadly, compression works markedly better the larger the record you give it, and since ZFS records are compressed independently, having a larger recordsize will usually yield better compression ratio.
  • IN GENERAL, performance is going to go LZ4 > zstd-fast > zstd > gzip.
    • Some people have reported better performance/compression savings with zstd-fast levels than LZ4, some have reported the opposite. I am not trying to pick a fight, but in general, my experience is that LZ4 is better on the compression ratio/performance tradeoff than zstd-fast, mostly because zstd-fast often fails to compress some things of mine. YMMV.
  • zstd is effectively always better performing than gzip
    • zstd is only supported with OpenZFS 2.0+
    • Early versions of OpenZFS zstd had an interoperability bug across endiannesses, so run OpenZFS 2.0.7+ or 2.1.1+ if you need that interoperability.
    • zstd's compression rates increase only marginally past the first few (1-3), but the costs scale drastically. I would strongly discourage using levels above 6 or so unless you really don't need write speed, and anything above 9 as absolutely never a good idea unless you care not at all about write speed.
    • ...until openzfs/zfs#13244 is included in the release you're running, but even then, on actually highly compressible data, kiss your write speed goodbye.
    • zstd decompression speed is approximately independent of compression level, so zstd-19 is going to decompress similarly fast to zstd-3 (particularly on ZFS, where we're just decompressing very finite blocks, not possibly long streams where more complicated things come into play...)
  • gzip output varies between Linux and non-Linux platforms, so if you're relying on dedup or nopwrite and gzip, you should probably change one of those. (Don't use dedup.)
  • IF someone ever updates the compressors in a way that changes the output, it would mean nopwrite/dedup don't necessarily do what you might hope.
    • At least my implementations of this involve a property for changing the compressor version if you need to keep the old behavior, but I'm not currently proposing them for merge, as the gains are marginal to negative in the compressors.

Dedup

  • Why must you hurt me like this
  • Please don't.
  • Dedup on ZFS is inline, meaning it happens as you're doing the write initially, and unless/until BRT lands, that's your only dedup option on ZFS.
  • Dedup tables on ZFS are global to a pool, so if you write the same data to pool/data1 and pool/data2, and they both have dedup enabled and the same checksum/compression/encryption settings, they will dedup against each other.
  • Dedup will only dedup across things in the same dataset for encrypted datasets, for several reasons.
    • Notably, receiving a zfs send -w stream will not dedup the contents, (I believe) even if you have the key loaded.
  • Dedup requires, in no particular order:
    • An understanding that once you turn it on, any data written with it on will require being rewritten via send|recv or cp or w/e after turning it off for the costs associated to depart.
    • A bunch of RAM to keep the dedup hash tables in RAM to consult for every new write and free on a deduplicated dataset. (Ballpark it around 320 bytes of RAM per record, which is gonna be around 2.5 GiB of RAM for a TiB of 128k records.)
    • (You can keep the DDT on special vdevs, either generic special vdevs or specially allocated "dedup" special vdevs that only store the DDT. This helps performance a bunch, but you still need some RAM for it.)
    • A checksum stronger (and thus more expensive) than fletcher4
    • The settings for "dedup" are of the form "on" or "[checksum name]" or "[checksum name],verify"
    • They override the "checksum" property when dedup is not set to "off"
    • The ",verify" options do an actual memcmp between the two blocks if the hash matches before treating it as a dupe.
    • A lot more computation to do said checksum (unless you're reading this after more of the hardware accelerated checksum work lands and you're on applicable hardware)
    • An order of magnitude or more increased time to write your data depending on how fast your storage is
    • Yes, really.
  • As mentioned, once dedup is on, even if you turn it off, the penalties of having to check the dedup tables for frees of blocks in it still remain.
  • If/when BRT lands, it will have a very different set of tradeoffs, which I'll elaborate on once the implementation is finalized and benchmarked to hell and back.
  • It would be possible, in theory, to improve the performance of the existing dedup implementation in many cases, but so far nobody with a vested interest in using dedup has done the work (see mahrens' "Dedup doesn't have to suck" talk.)

Encryption

  • My current advice remains just use LUKS or GELI.
  • If I knew when it was initially merged what I know now, I'd have aggressively pushed for it to be reverted until it was fixed.
    • Unfortunately three major releases shipping it means that's not so much an option any more.
  • It's very unpolished.
    • zdb can't peer into anything encrypted, it just says "it's encrypted" and gives up.
    • Git tip has support for zdb doing this. I haven't tested it much, and I think it just unlocks things, not lets you examine the encrypted data very much, but still, better than nothing.
    • (This isn't a fundamental limitation, it's just nobody wrote it.)
    • Key management is cumbersome and buggy.
    • Nobody's reported any flaws in the actual encryption.
  • It has bugs dating back to the initial merge. Some of them are even known.
    • Most (but not all) of them require send|recv.
    • Some people never hit the bugs, because a lot of them are races or highly data-dependent, and some people very readily do.
    • Nobody's actively working on fixing them as of this writing because the original contributors are no longer maintaining the code, and nobody else has volunteered.
  • One class of bugs involves incorrectly updating the "key" an encryptionroot thinks its key is encrypted with, so that next time you need to unlock the datasets (say, on reboot), it will never succeed.
    • Special unmerged code is required to recover the encrypted datasets in this case.
    • Most of them involve mixing zfs send -i and having done zfs change-key on one or both sides.
  • As alluded above, most of them require using send/recv and receiving a dataset encrypted.
    • Notably, at least some of them can still trigger even if you did zfs send unencrypted and just do a zfs recv -o encryption=on.
  • Very few of them involve actual theoretically unrecoverable data loss (versus "just" setting the incorrect key and requiring as-yet-unmerged error recovery code).
  • Currently the most common one involves triggering a panic in the ARC.
  • Again, right now, my advice remains just don't.

Volumes/ZVOLs

  • zvols are block devices whose backing storage is the zpool you create them on
  • Think "lvcreate" on an LVM volgroup, for example
  • Have a fixed blocksize at creation time - volblocksize
  • All blocks on a zvol are the same logical size - the volblocksize. Compression et al can apply.
  • As "fixed" suggests, volblocksize cannot be changed after zvol creation.
  • volblocksize defaults to 8k/16k depending on version of OpenZFS
  • As with any small recordsize/volblocksize, using this with {high ashift vdevs, raidz/draid vdevs} can have extreme overhead properties. See the section on raidz above for more explanation.
  • A small difference between the ashift/minimum size on the vdev and the volblocksize can mean you almost or actually never can get compression to save sufficiently much to be worth storing compressed - e.g. if your smallest block can be 8k, and your volblocksize is 16k, you'd need to save 50% space to be worth saving compressed.

Suspend to RAM and suspend to disk

  • I'm tired of people arguing with me about this when I have no personal use case for it, so it's gone now.

Snapshots and bookmarks, or a handwavey explanation of why zfs send works

  • First, a note - a snapshot of dataset foo/bar would be named something like foo/bar@snap1, and a bookmark would likewise be something like foo/bar#snap1.
    • (Bookmarks do not have to have the same name as the snapshot they're based on, but I'm doing so here for clarity of what they were made from.)
  • Approximately everything that gets written on ZFS has transaction group entries for the block, noting when it was created and, eventually, deleted.
  • Transaction groups are poolwide and numbered monotonically increasing - if you have something written at txg 12, it is guaranteed that txg 13 was committed "later", and so on.
  • Snapshots are datasets which are children of existing datasets (filesystems or volumes), which represent a point in time copy of the dataset at the time the snapshot was taken, in its entirety.
  • Conceptually, you can think of this as containing a transaction group number that represents the point in time on the dataset, and an implicit "hold" that stops ZFS from deleting any of the metadata/data on disk that was referenced at that txg on the dataset even if you have since deleted it on the live dataset.
  • The "USED" column for snapshots represents "if I deleted this snapshot, and only this snapshot, how much space would I free up"
    • Deleting snapshots can therefore affect the USED of other snapshots around it, and the sum of all USED entries for multiple snapshots does not necessarily indicate the total amount of space one would see freed if you deleted them all.
  • As suggested by the above, you can know when it's safe to delete something in ZFS "for real" from the pool when there are no longer any snapshots which contain the window between its creation and deletion.
  • zfs send serializes this point in time copy of the dataset to replicate it elsewhere, with flags to control things like whether it also copies the pool's custom properties, or leaves already-compressed records compressed over the wire, and so on.
  • zfs send -i foo@snap1 foo@snap2 works by, broadly:
    • Looking at snap1's txg and snap2's txg
    • Walking all the metadata for snap2 and, if it was modified after snap1's txg, including it in the incremental send.
  • Since txg numbers are per-pool, GUID of the snapshot is used to be sure that both sides, when doing an incremental send/recv, are starting from the same snap1 to transform into snap2.
  • Bookmarks come from the realization that, if we know just the txg of snap1, we don't need to have the original snapshot's data around if we want to just do incremental sends relative to it. So bookmarks, originally, contained effectively just the txg of a snapshot they were made based on, and no "hold" on the old data from that snapshot.
  • This means that you can use bookmarks to, say, if you had replicated 6/1's daily snapshot of foo from src to dst already, take a bookmark of it on src, then delete src@snap_2022_6_1, and still be able to do zfs send -i src#snap_2022_6_1 src@snap_2022_6_2.
  • They are only useful as an incremental send source - you cannot do full sends with them, nor an inrcemental receive relative to it if all you have is the bookmark.
  • Theoretically, one could extend zfs send -I or zfs send -R with -i or -I to allow using bookmarks, but it was not originally implemented when the feature was added, and the -I and -R codepaths do rather a lot. So for now, you have to do something like zfs send -i foo#snap_2022_6_1 foo@snap_2022_6_2 | zfs recv ...; zfs send -I foo@snap_2022_6_2 foo@snap_2022_6_15 | zfs recv ... if you wanted to do something like -I replication.
  • Bookmarks have since become slightly more complicated - it turns out it's important to include certain additional metadata when doing send/recv on encrypted datasets, so pools with the bookmark_v2 feature will create bookmarks with this additional metadata when needed, and complain to you if you have older ones around on encrypte datasets until they're gone.
  • Redacted bookmarks are a pretty different beast altogether from other bookmarks - they are a thing used for the redacted_send feature, which allows one to tell ZFS to, say, "if you do a redacted send from this redaction bookmark, do not include etc/shadowor home/rich/.my_secret_keyss blocks in it". This works, in practice, by keeping a list of the things that it's not supposed to send, so unlike "ordinary" bookmarks, redaction bookmarks do take up a nominal amount of space based on how much you've told them to not include.

SIMD and woes therein

  • SIMD is, for purposes of this discussion, a general term for using specialized instructions on your CPU which are very fast at very specific data-crunching tasks to implement certain things much faster than might be done otherwise. If you want to go dive into the full meanings involved, feel free
  • FPU manipulation, despite just being a specialized set of instructions and not necessarily SIMD per se, often involves the same kinds of locking and constraints, and so they can often end up bundled together for purposes of discussion and handling.
  • Often, using SIMD instructions requires specialized locking, resources, or modes on your processor to do, and switching to and from these modes can be expensive, so traditionally, many kernels (e.g. Linux) avoid using floating point or SIMD instructions for anything except very specific cases where it's an enormous win.
  • In userland, this is more or less taken care of for you, so you can just go to town and the worst that happens is you crash if you do something explicitly illegal (e.g. some instructions require only aligned memory addresses be given to them, and if you hand them otherwise, it will fault your program, but the world at large won't care).
  • If you attempt to use SIMD instructions from the kernel without the correct handling in place (if e.g. you overrode CFLAGS to remove the bunches of "no, only use the normal least common demoninator instructions and don't insert special magic" that Linux specifies), you can end up with mysterious inconsistent crashes in basically any part of userland, since it turns out lots of things (e.g. string or memory read/copy functions, math functions, hash functions...) can use SIMD acceleration in userland, and if you're mangling state in them sometimes, then sometimes things will break in truly astonishing ways. Beware.
  • Linux has increasingly restricted the ability of non-GPL software to invoke its SIMD save/restore functions on e.g. x86, to the point where as of OpenZFS 2.1, we just implemented our own complete save/restore code to use if nothing else is available because Linux had removed our access entirely, which has some performance implications because, not being the kernel, we can't make certain optimizations based on being able to "know" other threads haven't touched the SIMD state and don't need to actually save/restore.
  • It's common for people to include updates which break this in the trees nominally for "stable bugfixes" only out of spite, so you can end up with situations like when Ubuntu shipped a kernel for 2+ months with a ZFS version that could not use SIMD acceleration because the stable tree they updated from had broken it again.
  • Currently, SIMD acceleration gets used for (mostly Linux-specific, TODO FreeBSD has its own list of support):
    • fletcher4 hashing on x86 and ARM
    • raidz parity computations on x86, ARM, and POWER
    • AES and SHA2 calculations on x86 (currently SSE/AVX and AES-NI only, no SHA-NI for you)
    • BLAKE3 hashing on x86, ARM, and POWER
    • technically the zstd codebase will use BMI2 on x86 if available for both compression and decompression but since that doesn't require complex management around it I'm only mentioning it to preclude someone complaining I didn't later
  • The easiest way to notice, other than noticing it complaining in config.log, would be to check in /sys/module/zcommon/parameters/zfs_fletcher_4_impl - if you know you have a platform-specific optimized version and only see [fastest] scalar superscalar superscalar4, then you're in for a slow time.
    • (I've been meaning to include a check that complains in zpool status about this, but haven't had time since I thought about it to implement and get it merged.)

L2ARC devices

  • ZFS supports a "cache" vdev type, which is intended to store data that was likely to be expelled from the ARC soon on faster storage than your main pool (e.g. SSDs).
    • Unlike what some people might expect, this is not a guaranteed process - the L2ARC thread scans the "gonna get booted out" soon lists and adds things at a limited rate from it to the L2ARC. It can't be synchronous like adding to the L2ARC when it gets booted from the ARC, or freeing RAM would block on writing to your SSDs, and that's sad for everyone.
  • With OpenZFS 2.0.x and newer, L2ARC is now persistent across reboots - that is, previously, on reboot, it treated the L2ARC device as empty again, and now it does not.
  • You cannot, as of OpenZFS 2.1.x, use zpool trim on an L2ARC device - you can enable a tunable to make it TRIM the next N entries from the L2ARC device, but since it's a very simple ondisk structure, there's not currently anything implemented to go through and TRIM it. (I think, but am not certain, you could, but there's not anything implemented yet.)
  • L2ARC entries take up a tiny amount of ARC memory for their headers each
    • This amount, unless I'm bad at reading code, varies by pointer width - that is, 32-bit platforms take up one amount of RAM per entry, 64-bit platforms another.
    • (In particular, l2arc_dev_t * is a pointer, and list_node_t is, on Linux, a Linux struct list_head, which is composed of...two pointers. So we get a size difference of 12 bytes per, for 70 bytes per header on 32-bit systems, or 82 per on 64-bit, assuming nobody forced alignment when I wasn't looking.)
  • The above implies that you can't add infinite amounts of L2ARC, because you'd run out of memory for actual ARC data.
  • I am of the opinion that L2ARC is a reliable improvement only in very specific circumstances - namely, if you have a workload where your data can't all fit in ARC at once, but can fit in ARC+L2ARC at once. Other people disagree, loudly. You may draw your own conclusions.
  • You can add and remove L2ARC devices to your heart's content without restriction.
    • There was a bug in older versions of OpenZFS which could cause a panic if you removed an L2ARC device at just the wrong time, but that's been fixed for a while now.

wtf zpool scrub

  • ZFS scrub just walks all the metadata in the pool and reads all of it, and reads all the data referenced by the metadata, and implicitly in doing so, checksums all of it, and triggers any auto-repair.
  • This doesn't notice native encryption decryption errors since the checksums are on the encrypted data, and otherwise you'd need the keys to scrub the pool.
  • It also doesn't notice, or fix, anything else, because that's literally all it does.
  • Originally, it just tried reading the data in whatever order it ran across it, and then saved a note of the last thing it found every so often for scrub resume.
    • It turns out, spinning disks do really badly with essentially random disk IOs.
    • So now, scrub has been reworked to read the metadata into memory, then group it up into large sequential blobs of IO as much as possible, and issue those, so the drives see mostly sequential big IOs, which disks tend to do much better at.
    • In the new output, "scanned" means the metadata referencing this much data has been read, and "issued" means "we actually read the data for this much".
  • What does this mean for resuming it later, though?
    • Every (by default) 2 hours, it flushes the entire pending "scanned" list, then saves the old-style marker for how far it got, to avoid needing a feature flag for scrub. Unfortunately, on some pools, this leads to significantly reduced scrub performance for the duration of the "flush", which is often much longer than 2 hours...

No, LZ4 is not magic

  • LZ4 is very fast, but people really like saying it's the only compression algorithm with "early abort" in ZFS, and that's just false.
  • There's two different things at play here.
    • ZFS hands all compression functions a buffer 12.5% smaller than the input, and all the compression algorithms are expected to error out rather than running off the end, and throws out the compressed result if it's not at least that much smaller. (In practice, it gives a same-sized buffer and lies about its size to the compression function.)
    • LZ4 (and zstd) both will skip over incompressible blocks in a similar way, but neither of them works the way that people keep describing it.
  • Both of LZ4+zstd do something like "break this input into X-sized chunks, if this chunk has nothing compressible at all within the first Y%, just skip trying to compress it and go to the next chunk", with different sizes and thresholds depending on the compression level, in zstd's case.
    • The reason the "zstd early abort" feature is very useful is that, in the above description, you're still running the compression step each chunk, and that is still very expensive on incompressible data, even if it gives up, say, halfway into each 32k chunk or something.
  • You may find Yann Collet's thoughts on the early abort feature interesting reading as well.

RAIDZ, dRAID, and checksum errors

  • Oftentimes, people will see checksum errors across most or all of a raidz or draid vdev, and assume something has broken horribly in their disk controllers if the errors are across multiple disks.
  • A key thing here, however, is that if ZFS cannot figure out which disk was wrong (e.g. it didn't get an IO error from a disk, the checksum is wrong, and parity didn't correct it), it will mark a checksum error on every disk the record was on.
  • So if you see read or write errors on everything, then yes, something is probably very wrong with more than just the disks, but checksum errors across parity-based raid might just mean enough disks misbehaved that you can't know who was wrong, so we're blaming everybody.
  • Consider this real example.
    • Once, I had built a system with raidz3 vdevs to test some hardware before building a large system out of it.
    • I wrote a bunch of data from some of our existing datasets to it to see how it compressed on ZFS, how fast it ran, and so on, and then started scrubbing with nothing else going on to see how fast that went.
    • I got a lot of (corrected, since it was raidz3) checksum errors on specific disks, with no IO errors from the disks.
    • Surprised, I cleared and ran a scrub again, and the corrected checksum errors went up again.
    • The root cause turned out to be that the individual disks had a firmware bug where they didn't actually write data they had already said was written but was in cache if you issued SMART requests to them, leaving essentially random contents in some parts of records on individual disks.
    • Since whether or not the data was in cache varied across disks, that much redundancy was enough to save me, but since we were polling SMART on essentially every disk at once, any less redundancy would have shown up very differently, as all the disks in the whole record having a checksum error at once, and then it would have looked like something larger was broken, even though it was the individual disks misbehaving.

If you have a request for something to be included on here, or would like to loudly complain that I got something wrong, please feel free to ping me or leave a comment or w/e, the worst that happens is I ignore it or forget it.

WIP WIP WIP

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment