rrevans/verbose 15526.md Secret

## verbose 15526.md

      
    Raw
  

              verbose 15526.md
            
          
    OpenZFS #15526 was fixed by improving dnode_is_dirty check in #15571. This note deep-dives into the root cause, triggers, history, and remaining problems with that fix.
Recap

As described in @rincebrain's explanation, tools like cp use llseek(..., SEEK_DATA or SEEK_HOLE) to preserve sparseness in files. Due to a race checking if a file is dirty, llseek could return incorrect offsets that skip over data in newly-written files. When this happens, cp records holes where there should be data and the resulting file is incorrect. #15571 updates the dirty-object check dnode_is_dirty to fix this problem.
Update: #15615 is a new PR by @robn that uses dn_dirty_txg to detect dirty dnodes.
Background

Copying tools that preserve sparseness can use llseek in a loop alternating between SEEK_HOLE and SEEK_DATA to find the each hole or data block in the file. When llseek returns ENXIO error for SEEK_DATA, this signals that there is no further data in the file after the given offset. If SEEK_DATA returns an incorrect offset or ENXIO error before the end of data is reached, a hole will be reported where there is data. The copying tool will treat the skipped part of the file as a hole. (The opposite case also exists, but treating a hole as data has fewer consequences.)
zpl_llseek implements llseek by calling dmu_offset_next via zfs_holey and zfs_holey_common. dmu_offset_next is itself a wrapper around dnode_next_offset which locates the next data block (or hole) in an object.
dnode_next_offset works by inspecting the block pointer tree to find the next pointer at each level of the tree that points to data or a hole (possibly indirectly). It calls dnode_next_offset_level to scan block pointers each level stopping at the first non-hole pointer (if looking for data) or pointer with blk_fill less than the maximum number of leaf blocks for that level (if looking for holes). The search starts at the L1 pointer of the leaf block containing the starting offset and proceeds as an in-order traversal over the tree. The search visits each level of the tree at most twice since block pointers at each level are sufficient to tell if there are L0 leaf blocks (or holes) reachable via those pointers.
Block pointers are only updated from sync context while writing out in-memory data. Thus dnode_next_offset only returns correct results if the on-disk tree matches the file structure including number of levels, recordsize, etc. The results will be incorrect if there are dirty buffers that add (or free) blocks that have not been synced out.
dmu_offset_next avoids calling dnode_next_offset while dirty data exists by forcing a pool sync whenever dnode_is_dirty detects the object is dirty and zfs_dmu_offset_next_sync=1. Otherwise dmu_offset_next returns EBUSY to give up on hole detection, in which case llseek returns offset for SEEK_DATA or the file size for SEEK_HOLE indicating that the remaining file is entirely data. Additionally, dmu_offset_next cannot tolerate concurrent writes to the object that would re-dirty it, so zfs_holey_common also locks the entire file with a range lock.
Before #15571, dnode_is_dirty would in rare cases return FALSE even when the object was not fully synced out. This leads to dnode_next_offset computing offsets using on-disk block pointers before (or while) they are updated during sync. Depending on the state of the on-disk pointers, the results can skip over data.
Failure modes

If dnode_next_offset is called on an object before it is synced, the results depend on whether or not the object has been synced before, whether any blocks have been added or freed, and if so whether or not sync has updated the block pointers.

If the object is new and never synced, then the object will appear to contain no data.
If called after an object has been synced once, then dnode_next_offset will reflect the on-disk blocks ignoring in-memory dirty writes or frees.
If no blocks have been added (or freed) vs. on-disk state, then the results will be correct even if some blocks are dirty.
If blocks have been added (or freed) but the block pointers have not been updated, dnode_next_offset will skip over those blocks (or holes).
If blocks have been added (or freed) and the sync has reached the point of updating all block pointers, then the results will be correct even if sync has not completed.

If an object is dirtied in multiple txgs, the above applies for each txg. The results of llseek will be incorrect until the most-recently dirtied txg reaches sync and updates the block pointers if any txg adds or frees blocks.
Block pointers get updated in the write pipeline before vdev writes, so the window of time that results are incorrect is not dependent on storage performance.
Note that dnode_next_offset reads of block pointers are atomic for indirect blocks as they are guarded by db_rwlock. For block pointers in the dnode itself, dnode_next_offset does not hold locks for dn_phys->dn_blkptr and these are racy reads. On weakly ordered machines, it is theoretically possible that dnode_next_offset_level observes partially written block pointers even if sync is already completed.
Syncing details

In open context, dirtied dnodes in each object set are added to os_dirty_dnodes. Dirtied data and indirect blocks are added to trees of dirty records in each dnode rooted by dn_dirty_records. These trees follow the same structure as the to-be-written block trees with the dnode pointing at dirty records for top-level blocks and dirty indirect blocks pointing to their dirty children recursively in the corresponding indirect buffer dr_children list.
In sync context, dirty dnodes are written by dmu_objset_sync_dnodes. This first removes each dnode from os_dirty_dnodes and adds it to os_synced_dnodes using the same link (dn_dirty_link). Then it calls dnode_sync to write the object including updating dn_phys, growing indirection, freeing leaf ranges, and writing dirty buffers.
Dirty buffers are issued by dnode_sync calling dbuf_sync_list for each dn_dirty_records recursively issuing writes for all blocks of the tree. dbuf_sync_list removes each dirty record from the parent dnode or indirect child list then issues the leaf write or recursive indirect block writes. Sync context does not hold any locks for the dirty record tree traversal because all updates are completed in quiescing context before sync starts.
Note that the dnode is removed from os_dirty_dnodes before syncing the dnode itself and dirty buffers are removed from dn_dirty_records before starting writes for blocks pointed to by the dnode. Block pointers in the dnode or indirect blocks are not updated in memory until the ZIO pipeline asynchronously compresses the block, allocates space, and invokes the dbuf_write_ready callback.
All write ZIOs created for a dnode are children of dnode's buffer's ZIO previously created when dmu_objset_sync calls dnode_sync on the meta-dnode. These ZIOs are in turn children of the logical root ZIO created in dsl_pool_sync. After issuing writes for all datasets, dsl_pool_sync waits on the root ZIO which in turn waits for all child writes including all dnode, data, and indirect block writes. After this wait, all dataset block pointers are guaranteed to be updated and dsl_pool_sync calls dmu_objset_sync_done for each dataset which updates userquota (if enabled) and removes each synced dnode from the os_synced_dnodes list.
Root cause

The root cause of #15526 is a race in dnode_is_dirty between checking if a dnode is dirty and scanning its block tree in dnode_next_offset.
Before #15571, dnode_is_dirty checks only that dn_dirty_link is active to determine if the dnode has dirty buffers. This link is used in both the os_dirty_dnodes and os_synced_dnode lists.
Checking dn_dirty_link is almost correct. This link is set when the dnode is dirtied in dnode_setdirty by adding the node to os_dirty_dnodes. During dmu_objset_sync_dnodes the link is reused for os_synced_dnodes. It is cleared by dmu_objset_sync_done removing the dnode from os_synced_dnodes after the waiting on the root ZIO. This guarantees the block pointers are updated.
However there is a small window of time during dmu_objset_sync_dnodes after which dn_dirty_link is removed from os_dirty_dnodes but before it is added to os_synced_dnodes. This is the root cause of the issue.
In this window of time dnode_is_dirty may observe dn_dirty_link as inactive and thus return that the object is clean when it is not. The move between lists is not protected by any mutex held by dnode_is_dirty or dmu_offset_next: dmu_objset_sync_dnodes only holds the sublist mutex for os_dirty_dnodes at this point and dnode_is_dirty holds dn_struct_rwlock and dn_mtx.
The time period where dn_dirty_link is inactive while moving between lists appears to be only a few instruction cycles long, but os_synced_dnodes is a multilist with its own child list mutexes that must be acquired for multilist_insert. List insertion is never contended as each multilist child list of os_dirty_dnodes is synced in a separate task which in turn writes to the same child list index of os_synced_dnodes as both use dnode_hash. However, on Linux kernels compiled with preemption support, the mutex acquistion for the child list mutex in multilist_insert is a preemption point even when uncontended and may sleep. Under sufficient load the kernel may preempt the sync thread for one or more scheduling quanta. This leaves a much longer time window in which the dn_dirty_link is not in either list making the window significantly easier to hit.
The mutex sleep is a confounding cause and explains why reproducing the race is non-deterministic. The kernel's scheduler plays a role in preempting dmu_objset_sync_dnodes unpredictably, and this preemption is triggered by other load on the system. This explains why multiple file or threads are necessary to reproduce: one thread triggers sync and another calls llseek while the corresponding file object sync thread is preempted.
Beyond this, the plain reads of dn_dirty_link in dnode_is_dirty are also racy. dnode_is_dirty holds dn_mtx while dnode_setdirty holds the os_dirty_dnodes sublist lock for insertion, dmu_objset_sync_dnodes holds os_dirty_dnodes for removal before holding os_synced_dnodes sublist lock for reinsertion, and dmu_objset_sync_done holds the os_synced_dnodes sublist lock for final removal. The file-level rangelock synchronizes the call to dnode_is_dirty vs. dnode_setdirty, but no locks guarantee the ordering of observing an inactive dn_dirty_link vs. writes to dnode or indirect block pointers.
The fix

PR #15571 mitigates the problem by checking both dn_dirty_link and dn_dirty_records in dnode_is_dirty. This is effective because dn_dirty_records is always non-empty during the window of time in dmu_objset_sync_dnodes where dn_dirty_link is inactive if there are any leaf blocks being added. Thus the case where dn_dirty_link is temporarily non-active while preempted is covered by the dirty record check.
Note that dn_dirty_records still becomes empty before the sync updates all indirect pointers as the ZIO pipeline updates them asynchronously in callbacks. The fix is effective only because dn_dirty_records is non-empty in the window where dn_dirty_link is temporarily inactive, and dn_dirty_link is providing the main synchronization with zio_wait in dsl_pool_sync.
Also note that while dn_mtx protects dn_dirty_records and is held by both dbuf_dirty and dnode_is_dirty, dnode_sync does not hold dn_mtx while dbuf_sync_list removes records from dn_dirty_records. So the read in dnode_is_dirty is still racy with respect to sync context clearing it despite holding dn_mtx.
This means even that after the fix, dnode_is_dirty still contains data races. While the file-level rangelock and dn_mtx guarantee that dnode_is_dirty will observe dbuf_dirty writes to both dn_dirty_link and dn_dirty_records, there are no guarantees that the thread calling dnode_is_dirty will not observe the write clearing dn_dirty_records after observing the first write clearing dn_dirty_link but before the second write setting dn_dirty_link again even on strongly-ordered architectures.
However while the data races remain, the large window of opportunity caused by mutex preemption is fixed. The remaining data races are theoretically problematic but much less likely to occur in practice especially given the number of stores (and locks) that occur between setting dn_dirty_link for os_synced_dnodes and clearing dn_dirty_records even in the trivial case of a 1-block file.
History

The details surrounding dnode_is_dirty have gone through many iterations in OpenZFS history:


34dc7c2f25 c. 2008: Initial Linux ZFS GIT Repo

dmu_offset_next checks dn_dirty_link without holding any locks.
dmu_objset_sync_dnodes clears dn_dirty_link before calling dnode_sync then releases the sync txg hold on the dnode.
In this version, the window for the race is quite wide, and dmu_offset_next will return incorrect results anytime after sync starts until all block pointers are updated.
But in this version llseek isn't implemented - only _FIO_SEEK_HOLE.


9babb37438 c. 2009: Rebase to b117

Adds userquota support.
With userquota enabled, dmu_objset_sync_dnodes moves the dnode to a new list reusing dn_dirty_link until after zio_wait. Otherwise dmu_objset_sync_dnodes clears dn_dirty_link before dnode_sync.
Reduces the race for userquota datasets to a few instructions wide. (There is no mutex preemption point yet.)


3558fd73b5 c. 2011: Prototype/structure update for Linux

Removes _FIO_SEEK_HOLE and related zfs_holey code.


802e7b5feb c. 2013: Add SEEK_DATA/SEEK_HOLE to lseek()/llseek()

Adds llseek support for seeking data/holes.
Prior commit cf91b2b6b2 brings back zfs_holey.


64fc776208 2017-03-20: OpenZFS 7968 - multi-threaded spa_sync())

Introduces multilists and parallelizes dnode_sync during sync.
After this os_dirty_dnodes is protected by multilist lock instead of os_lock.
os_synced_dnodes becomes a multilist exposing the mutex preemption opportunity.
Widens the race window for userquota datasets on kernels with preemptible mutex_enter.
This version also forgets to update dn_dirty_link checks to multilist_link_active (but list_link_active happens to be equivalent).


66aca24730 2017-04-13: SEEK_HOLE should not block on txg_wait_synced())

Changes the dmu_offset_next dirty check to dn_dirtyctx and dn_dirty_records.
Makes txg_wait_synced conditional on zfs_dmu_offset_next_sync.
Both reads read are racy (should hold dn_mtx), but sync does not hold dn_mtx anyway.
dn_dirtyctx is not a reliable signal of dirtiness as it stays set until the in-memory dnode object is destroyed. (Its purpose is for assertion checks that objects are dirtied in exactly one of open or sync context.)
Greatly widens the race for userquota datasets: dn_dirty_records is cleared immediately after issuing writes vs. after userquota updates clears dn_dirty_link (after root zio_wait).
The race exists even with zfs_dmu_offset_next_sync=0 as the dirty check only bypasses sync only when required, but the test is still faulty.


454365bbaa 2017-11-15: Fix dirty check in dmu_offset_next()

First attempt at fixing reports of files filled with \0s.
Removes the dn_dirtyctx check and moves back to dn_dirty_link test.
Narrows the race for userquota datasets back to the mutex preemption window present in 64fc776208.


edc1e713c2 2018-04-10: Fix race in dnode_check_slots_free()

Part of a chain of fixes for similar sync-vs.-open context race reallocating freed dnodes.
Interestingly, while dnode reallocation is fundamentally the same (open context needs to synchronize with sync context), none of the fixes were applied to dmu_offset_next. These fixes ultimately introduced dn_dirty_txg and matching DNODE_IS_DIRTY macro comparing dn_dirty_txg to spa_synced_txg in edc1e713c2. (It should have been using dn_holds == 0 which was added back in 58769a4ebd, which is unfortunately not suitable for dnode_is_dirty.)
Incidentally fixes the misuse of list_link_active for multilist link dn_dirty_link in dmu_offset_next (but multilist_link_active is equivalent).


ec4f9b8f30 2019-03-21: Report holes when there are only metadata changes

Attempts to implement hole reporting when only the dnode metadata (atime) changes.
Switches to checking dn_dirty_link and that dn_dirty_records contains a record other than spill and bonus buffers.
Adds dn_mtx lock to dmu_offset_next for the first time.
Adds check in dmu_offset_next for non-empty dn_free_ranges (only for L0 buffers referenced by the dnode - indirect blocks are dirtied in dnode_free_range).
This makes the race trigger if either dn_dirty_records is cleared or if the mutex preemption case happens for dn_dirty_link.
Widens the race window again for userquota datasets to happen anytime after sync starts (like for non-userquota datasets).


2531ce3720 2019-05-30: Revert "Report holes when there are only metadata changes"

Reverted ec4f9b8f30 after #8816 reports more \0 filled files.


ba67d82142 2020-12-11: Improve zfs receive performance with lightweight write

Lightweight writes are not related but this changes dmu_objset_sync_dnodes to unconditionally use os_synced_dnodes.
This makes the race the same for userquota and non-userquota datasets (narrow mutex preemption window).


de198f2d9 2021-11-07: Fix lseek(SEEK_DATA/SEEK_HOLE) mmap consistency

This primarily fixes missing mmap support in zfs_holey_common by flushing before  dmu_offset_next.
Introduces dnode_is_dirty factored out of dmu_offset_next, but changes the implementation back to using dn_dirty_records.
Updates dmu_offset_next to re-check if the object is dirty again after txg_wait_synced. (Rangelocks should make this unncessary?)
Re-widens the race for all datasets to any time after dnode_sync clears dn_dirty_records.


c23803be84 2021-11-10: Restore dirty dnode detection logic

Updates dnode_is_dirty to go back to dn_dirty_link.
Narrows the race back to the mutex preemption window.


05b3eb6d23 2021-11-30: Default to zfs_dmu_offset_next_sync=1

Changes to prefer blocking on sync over reporting no holes.
Motivated by confusion over the expected behavior.
No effect on the race except perhaps forcing more syncing in the presence of SEEK_HOLE-heavy workloads.


519851122b 2023-03-14: ZFS_IOC_COUNT_FILLED does unnecessary txg_wait_synced

Adds back special-casing for the bonus and spill blocks like ec4f9b8f30 attempted.
Switches to dn_dirty_records check, re-widening the race.


accfdeb948 2023-04-18: Revert "ZFS_IOC_COUNT_FILLED does unnecessary txg_wait_synced"

#14753 reports data corruption.
Reverts back to dn_dirty_link check restoring the narrower race condition.


30d581121b 2023-11-28: dnode_is_dirty: check dnode and its data for dirtiness

The fix for #15526.


And prior to the OpenZFS fork, see also c. 2006 illumos c543ec060d which introduces dn_dirty_link. Previously dmu_offset_next checked dn_dirtyblksz without locks suggesting that the race exists in that version as well.
Reproducing

Adding a delay to dmu_objset_sync_dnodes between removing the dnode from os_dirty_nodes and inserting into os_synced_dnodes will reliably recreate the race using @tonyhutter's reproducer.sh modified to write 10 files instead of 1000 (because the delay makes syncing very slow). This proves that mutex preemption is the trigger for the race.
diff --git a/module/zfs/dmu_objset.c b/module/zfs/dmu_objset.c
index f098e1daa..8f560bb13 100644
--- a/module/zfs/dmu_objset.c
+++ b/module/zfs/dmu_objset.c
@@ -1582,6 +1582,8 @@ dmu_objset_sync_dnodes(multilist_sublist_t *list, dmu_tx_t *tx)
                 */
                multilist_t *newlist = &dn->dn_objset->os_synced_dnodes;
                (void) dnode_add_ref(dn, newlist);
+
+               delay(MSEC_TO_TICK(100));
                multilist_insert(newlist, dn);
 
                dnode_sync(dn, tx);
Adding a delay during the async issue ZIO pipeline stage will reliably reproduce the race if dnode_is_dirty or dmu_offset_next checks only dn_dirty_records. This simulates the race in older versions and proves that the race is dependent on the ZIO pipeline computing block pointers.
diff --git a/module/zfs/dnode.c b/module/zfs/dnode.c
index 029d9df8a..f2efe9c2c 100644
--- a/module/zfs/dnode.c
+++ b/module/zfs/dnode.c
@@ -1786,7 +1786,7 @@ dnode_is_dirty(dnode_t *dn)
        mutex_enter(&dn->dn_mtx);
 
        for (int i = 0; i < TXG_SIZE; i++) {
-               if (multilist_link_active(&dn->dn_dirty_link[i])) {
+               if (!list_is_empty(&dn->dn_dirty_records[i])) {
                        mutex_exit(&dn->dn_mtx);
                        return (B_TRUE);
                }
diff --git a/module/zfs/spa.c b/module/zfs/spa.c
index 5161326c0..9c7d02e3d 100644
--- a/module/zfs/spa.c
+++ b/module/zfs/spa.c
@@ -1241,12 +1241,41 @@ spa_taskq_dispatch_select(spa_t *spa, zio_type_t t, zio_taskq_type_t q,
        return (tq);
 }
 
+typedef struct issue_delay {
+  taskq_t *tq;
+  task_func_t *func;
+  void *arg;
+  uint_t flags;
+  taskq_ent_t *ent;
+} issue_delay_t;
+
+static void
+spa_taskq_dispatch_delay(void *arg)
+{
+  issue_delay_t *d = arg;
+  taskq_dispatch_ent(d->tq, d->func, d->arg, d->flags, d->ent);
+  kmem_free(d, sizeof(*d));
+}
+
 void
 spa_taskq_dispatch_ent(spa_t *spa, zio_type_t t, zio_taskq_type_t q,
     task_func_t *func, void *arg, uint_t flags, taskq_ent_t *ent,
     zio_t *zio)
 {
        taskq_t *tq = spa_taskq_dispatch_select(spa, t, q, zio);
+       if (q == ZIO_TASKQ_ISSUE) {
+         issue_delay_t *d = kmem_alloc(sizeof (issue_delay_t), KM_NOSLEEP);
+         if (d != NULL) {
+           d->tq = tq;
+           d->func = func;
+           d->arg = arg;
+           d->flags = flags;
+           d->ent = ent;
+           clock_t delay = ddi_get_lbolt() + MSEC_TO_TICK(100);
+           taskq_dispatch_delay(tq, spa_taskq_dispatch_delay, d, flags, delay);
+           return;
+         }
+       }
        taskq_dispatch_ent(tq, func, arg, flags, ent);
 }
Other issues

dnode_next_offset is used in other callsites, and it's unclear if those have issues or not.

dmu_object_alloc_impl could miss already-allocated pages leading to allocation collisions
dmu_objset_space_upgrade could miss dirty objects in the meta-dnode
dmu_tx_count_free will not count blocks dirtied in previous txgs
dmu_free_long_range_impl will size chunks without considering blocks dirtied in previous txgs

dnode_next_offset forcing txg syncs remains a surprising performance issue. coreutils-9.2 defaults to using sparse file detection which exacerbates this. Furthermore ZFS reports only on-disk size in stat, so all newly-written and compressed files will appear sparse even if they contain no holes.
Summary


The underlying race has existed in some form for the entire history of OpenZFS.
The window of opportunity for the race has varied, but was approximately one of:

a very small number of instruction clock cycles
possible mutex preemption during dmu_objset_sync_dnodes
anytime after dnode_sync until all block pointers are updated


Before ba67d82142, the window was smaller for userquota datasets vs. non-userquota.
The race is dependent on scheduling delays and compute performance (compression).
The race is not dependent on I/O performance (block pointers update before vdev issue).
dnode_next_offset does not return correct results until all block pointers are updated.
dnode_is_dirty and dnode_next_offset_level still have data races (in theory).