Skip to content

Instantly share code, notes, and snippets.

@jimklimov
Created August 12, 2020 13:42
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jimklimov/7a2f047a756c7623e030ec4cd515fd8c to your computer and use it in GitHub Desktop.
Save jimklimov/7a2f047a756c7623e030ec4cd515fd8c to your computer and use it in GitHub Desktop.
zfs recv lags discussion
(2020-08-10 02:40:30) jimklimov: cheers all who won't sleep :) doing some ZFS replications and hit that annoyance again about how slow it is to `zfs recv` many small snapshots. Is there some tunable to speed it up a bit? Looking at speed numbers in the console, it seems that for every incremental snapshot it spends over 60s regardless of size. So empty 312byte snapshots take a minute at average 5 b/s, and larger multimegabyte snaps take 70s, and huge ones average at relatively decent looking speeds in MB/s range. What can I do to not wait that 1 minute every snap?
(02:47:40) jimklimov: looking at the consoles, I see it doing zero I/O on source and destination pools (zpool iostat 1), then the send|recv pipe says "receiving incremental stream of X into Y" and there is a burst of I/O for a second, and the pipe logs how it received a few kilobytes in a lot of seconds...
(02:58:43) jimklimov: using Adam's scripts from http://dtrace.org/blogs/ahl/2014/08/31/openzfs-tuning/ I see that the destination pool txg-syncing sits at `0MB of 817MB used` except for every 11th or so line where it has the burst to flush out...
(02:59:59) jimklimov: each line seems to be at a 5-second frequency, so I suppose they are the default syncs... and they have nothing to write - zfs recv just sits and doesn't give the kernel anything to write, it seems.
(05:04:41) LeftWing: jimklimov: Sounds pretty plausible that it would wait for, but not force out, a txg sync
(05:05:08) LeftWing: You ought to be able to look at the code to see where it is though
(05:05:26) LeftWing: When it is hung, look at the user and kernel stacks for the "zfs recv" process to see where in the code it is waiting
(05:06:32) LeftWing: User stack is pretty easy with pstack. Kernel stack you'll want to take the pid of the process (e.g., lets say it is 6022) and do something like: mdb -ke '0t6022::pid2proc | ::walk thread | ::findstack -v'
(05:28:46) jimklimov: FWIW, I launched a series of other send|recv pipes for different datasets, and all those streams seem to have different "standard overheads", like writing stuff every 10-11 sec from the zoneroot incremental backups, so now the backup disk at least has work to do nearly every second. So this also disproves the wait for TXT sync...
(05:33:07) jimklimov: which leads me to the hypothesis supported by pstack https://pastebin.com/ctU1kLse that it processes something about the destination's snapshots (the first pipe I've sent was for my rootfs with split roots, so there are several nested datasets and hordes of snapshots for each)
(05:34:21) jimklimov: while the data itself should be in zfs cache, so those investigations should not hit the disk, processing seems quite long... especially for something to do over and over for hundreds of incremental snapshots it is going to receive in this operation eventually (if it all survives)
(05:35:14) jimklimov: the later pipes are for zoneroots so there's less to process in terms of ZFS datasets adn snapshots
(05:39:26) LeftWing: jimklimov: How many snapshots and datasets are there already in this target pool?
(06:00:45) jimklimov: rough estimate? on the order of 100k snaps over 1k datasets, give or take
(06:00:53) jimklimov: zfs is still counting
(06:02:26) LeftWing: Good lord, that is a lot
(06:02:44) LeftWing: The stack you're seeing there suggests it is busily looking through all of them to find a dataset by guid
(06:03:03) LeftWing: Which, if you have a hundred thousand, is probably going to take some time and involve I/O to disk
(06:06:43) jimklimov: or did you mean just children of the replication target? that should be a lot less, about a 100-200 datasets across all rootfs iterations (and their child datasets), but still a lot of snaps
(06:08:05) jimklimov: IIRC there were some optimizations in the past to find certain attributes of datasets quicker for a typical "zfs list" etc.? are guids among those? maybe they should be, if so needed for send/recv :)
(06:09:24) jimklimov: this is just a little laptop doing its dataset per project/workspace thingie, and some znapzend (earlier zfs-auto-snapshots) on top of that to avoid unpleasant typos...
(06:10:17) jimklimov: but a roughly similar thing bites servers doing technologically similar backups but across somewhat fewer datasets, maybe a few hundred
(06:11:04) jimklimov: well, for listing all snapshots to count them it does involve a lot of I/O I guess, been busy for a while
(06:12:49) jimklimov: for the subset of all rootfs'es (target is `zfs recv -vue .../ROOT` to make out the name from what was sent), it has 263 datasets and 27k snaps, got back in just 8 minutes
(06:12:53) jimklimov: overall still counting
(06:13:19) jimklimov: or rather, did get 1081 datasets, and I assume about 100 snaps per dataset
(06:13:45) jimklimov: "zfs snapshots are cheap" right? so says the marketing since beginning of time ;)
(06:27:47) LeftWing: They are almost free to create
(06:27:59) LeftWing: Everything after that is more expensive, the more you have, basically
(06:28:58) jimklimov: makes sense, and matches practice
(06:30:16) jimklimov: but I wonder if I'm the only one to hit such cases and/or moan about them? at the very least, many zoneroots or homedirs and a few `beadm create` leftovers, and some automatic snapshots service is something many people have I gather...
(06:39:15) LeftWing: I think any serious/mature automated snapshot service absolutely must manage retention, not merely creation
(06:40:04) jimklimov: yes, it does, the matter is how many it keeps - and that is somewhat dictated by my policy
(06:41:00) jimklimov: if I say I want to roll back two days for half-hour steps to fix recent errors and fat-fingering, then a month daily and 10 years weekly, that adds up to a lot of reasonably founded snapshots
(06:41:03) jimklimov: per dataset
(06:41:35) LeftWing: On my file server I have snapshots dating back to 2008, but I have a program that progressively thins them out as they age, so I have around nine thousand in all
(06:42:47) jimklimov: well, again it is a matter of policy, so in the semihypothetical example above, 1 year weekly and 10 years monthly for example would reduce the snapshot load by a lot :-D
(06:43:21) LeftWing: Right, picking a sensible policy is important
(06:44:04) jimklimov: in my case it adds up to about 100 snaps per dataset, which seemed fairly feasible (a couple of days half-hourlies already does that much)
(06:44:31) jimklimov: and then some less frequent for darker ages
(06:44:55) LeftWing: For what it's worth, I imagine it could be made a lot faster, even if some sort of index had to be created as part of the pool data structure
(06:44:57) jimklimov: a lot of datasets for quotas, independent rollback, sharing, etc. also had their reasons
(06:45:27) jimklimov: and then it multiplied, one by another...
(06:46:28) jimklimov: I guess it could, if we knew well what makes sense to cache and/or persist for the speedup :-)
(06:46:56) jimklimov: in my case, note that there is no disk i/o to read stuff as `zfs recv` waits to write another increment
(06:47:03) jimklimov: all it needs is in ARC
(06:49:22) jimklimov: I suppose for this use-case, a `zfs send` might keep track about whatever it needs to know about the destination after parsing it once? It needs guids? Get them all, keep them, use for next datasets processed from the stream... (well, maybe also ask the kernel or subscribe somehow to know if someone else created another dataset/snapshot in its area of interest)
(06:49:52) jimklimov: it feels like now it collects the data, receives an increment, forgets the data, collects it again...
(06:50:16) jimklimov: while I see how it might expire and become invalid, there gotta be faster ways than to forget and re-collect
(06:50:24) jimklimov: pun intended ;p
(06:50:44) LeftWing: I expect there is a lot about it that could be made faster, yes
(06:50:46) jimklimov: also in many cases it might just not expire
(06:51:23) jimklimov: e.g. if this zfs recv is the only actor changing a branch in zfs tree of datasets/snaps
(06:51:34) LeftWing: Well I think the penalty for caching an incorrect value is: you get to the kernel and realise it is out-of-date, and have to scan again
(06:51:39) LeftWing: (But that is just a guess)
(06:51:51) LeftWing: I don't imagine that would happen a lot in practice
(06:52:08) LeftWing: Because most of the time I imagine the guids that it is looking for were created long ago, and on the remote system
(06:52:37) sensille: jimklimov: is it also slow to send a specific snapshot diff?
(06:52:56) LeftWing: I don't expect it has to do this lookup at all on the sending side
(06:53:03) sensille: libzfs (user mode) is notoriously full of quadratic algorithms
(06:53:04) LeftWing: There, you are likely looking the dataset up by name, which is relatively fast
(06:53:07) sensille: or was, at least
(06:53:11) LeftWing: And the guid is just a property you read
(06:54:17) jimklimov: does the receiver need to find a particular guid (check if we have same snap on two sides, so it can skip, or it actually has work to do)?
(06:55:00) LeftWing: This is only a vague recollection, as I have not looked at this in a long time, but: yes, an incremental snapshot includes the guid of the parent snapshot it is a diff from
(06:55:24) LeftWing: So that even if you rename a dataset on either the target or the source system, zfs recv can find the snapshot anyway
(06:55:31) jimklimov: so reading the destination dataset and snaps, you get a list of guids, the one being received is not there, you receive, for next it is still not there... and won't be neither in the cache nor on disk :\
(06:56:11) jimklimov: so I guess some way of reading them in and then quickly knowing that nothing got added (and or knowing what was added) is needed to speed up
(06:56:24) LeftWing: Whether an in-process cache would help seems a lot like it will depend on how many streams are batched together for a single "zfs recv" process
(06:56:57) LeftWing: If you are recursively sending a lot of snapshots in the same stream, an in process cache seems like it could help
(06:57:27) LeftWing: If not, one imagines a cache in the kernel or in a file would be needed -- or, there may well be a more efficient way to implement the lookup by guid
(06:57:28) jimklimov: sensille: did you mean as in `zfs send -i ds@snap1 ds@snap2`? no, that is fast
(06:58:30) jimklimov: caching with `zfs send -R -I ... | mbuffer | zfs recv` the cache stays full and holds I suppose many snapshots worth of data read from source and waiting to be written by receiver
(06:59:11) LeftWing: Yes, if you're sending many snapshots through that recursive replication package, I expect keeping a user mode cache would help
(06:59:32) LeftWing: If you want to knock together a patch for that, we can certainly look at it
(06:59:45) LeftWing: Should be pretty easy to see and measure the improvement
(07:00:28) jimklimov: as I learned, C and generally coding is far from my forte' :( analysis of what can go wrong is a stronger suit ;)
(07:01:06) jimklimov: and also I have a huge backlog of things to fix in areas I am more able to :\
(07:02:02) sensille: so did you establish the problem is on the user mode side?
(07:02:10) LeftWing: Well, these things only tend to get fixed when someone takes a swing. Analysis is certainly a good first step.
(07:03:29) jimklimov: also, generally an in-kernel cache would be useful for someone doing a series of independent zfs recv's (znapzend tends to, replication-package mode is not there yet for some reasons), probably with expiration properties of a cache... though to be useful, sized enough that the guids from start of a pool would need to not be pushed out as we read the end of it :)
(07:04:05) LeftWing: Well a cache is one option -- but a more efficient lookup structure is another
(07:04:33) LeftWing: If the guid->snapshot lookup is critical for performance, and it seems like it is, we should probably have a specific mapping table for that translation
(07:04:59) LeftWing: I imagine right now it is, instead, doing the regular dataset/snapshot enumeration which can be somewhat expensive and lead to a lot of random I/O
(07:05:16) jimklimov: yes, just wondering if someone finds it prohibitive to indefinitely hold in memory some 200k guids (approx, assuming my source and backup pools are similar)
(07:05:17) arekinath: pretty sure the walking around looking for the guid is done under the tree you specify as the recv destination first, so most backup programs for big pools just deeply nest and target their dataset structure to avoid it ever waking up the tree looking for one
(07:05:35) sensille: ... and if the guid->snapshot lookup is even necessary for operation. it might just be a sanity check
(07:05:40) arekinath: it sucks most when you have wide levels
(07:06:11) arekinath: guid->snap lookup is definitely needed, at least for incrementals, not just a sanity check :)
(07:06:14) LeftWing: sensille: Also a good question, but I am pretty sure it is needed
(07:06:17) LeftWing: right
(07:06:33) jimklimov: well, even in my subtree for rpool/ROOT I found above that the blast impact is about 30k snapshots on the backup disk
(07:07:11) arekinath: yeah but if you e.g. break off the older snapshots into their own dataset and make the new one you're recv'ing into a clone off one of them
(07:07:12) LeftWing: arekinath: By "wide levels" you mean a lot of snapshots for a particular dataset?
(07:07:14) arekinath: break the snapshots up
(07:07:15) sensille: i never noticed that problem, and we're doing tons of replication on pools with millions of snapshots
(07:07:21) jimklimov: which after first zfs recv runs is in memory, so all the looking for guid happens there anyway, but iterating takes a while
(07:07:45) arekinath: avoid ever having more than a few thousand "things" in any level of the dataset hierarchy
(07:07:52) arekinath: keeps the recv snapshot finding fast
(07:08:05) LeftWing: sensille: It is possible to structure the pool so you don't hit this
(07:08:33) arekinath: you do like a rename of the current name to thing-old then clone it back from the latest snap into its new name
(07:08:44) arekinath: and now the thing it searches at the first stage of recv is only 1 thing again
(07:08:56) sensille: also we don't use the zfs tool, only interact with libzfs directly
(07:10:55) jimklimov: arekinath: nice trick and certainly useful for backups, but for "live" datasets the runnign programs would probably be upset about the unmount/mount involved
(07:11:24) arekinath: ah. yeah, we've only ever had this issue on machines stuffed with backups
(07:11:32) arekinath: try to keep the "running" machines down to only a few snaps
(07:11:53) LeftWing: Well, at any rate, it's not going to get better if nobody takes a swing
(07:12:18) LeftWing: It isn't a problem I'm personally having, but if someone wants to work on it we will always try to help with advice and review and such as we can
(07:12:37) LeftWing: I will note that the only way to get better at C is to write some C and get some review
(07:12:56) arekinath: especially rm's review
(07:13:08) arekinath: ;)
(07:13:52) jimklimov: sure thing, just that with dayjobs and other communities, I'm overwhelmed and thin on time already :( and getting old to learn new tricks :(
(07:14:29) gitomat4 is now known as gitomat
(07:15:50) jimklimov: my best review comment in this regard was from trying to fix some java code and while the PR worked, it was "too shell-script like" so eventually some kind sould took the PoC algorithm and rewrote it with modern classes and constructs... :D
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment