mgerdts/design.md

## design.md

      
    Raw
  

              design.md
            
          
    Logical Volumes with Missing External Snapshots

The original design of external snapshots allows esnap clone lvols to successfully open before the external snapshot is present. A logical volume with a missing external snapshot is referred to as degraded. It was done this way because:

It is hard for the SPDK administrator to control the order that examine_config callbacks will be called.
The consumer of an esnap clone may not actually need to perform reads from the external snapshot because the clusters containing the requested blocks may have already been allocated in the esnap clone's blob.
Esnap clones may be at the root of a deep tree of snapshots and clones. It would be complex to delay the online of all of these until the external snapshot is available.
Immediate registration of bdevs simplifies management of degraded lvols.

bdev_get_bdevs allows degraded lvols to be seen. The module-specific section indicates when it is degraded.
bdev_lvol_delete allows degraded esnap clones and their clones to be deleted. Note there is no "snapshot of an esnap clone" because when an esnap clone vol1 is snapshotted with vol2, vol2 becomes a read-only esnap clone and vol1 becomes a clone of vol2. This is important to be able to clean up an lvolstore when an external snapshot is missing, especially if the external snapshot has been destroyed.
bdev_lvol_rename allows degraded esnap clones to be renamed. This may be important to make the name available for a replacement while the original is being debugged.


Immediate registration reserves the lvol name and UUID so that no one else can squat on it.

Problem statement

The initial solution to this was to have reads that depend on missing esnap clone to immediately call the completion callback with -EIO. During review concern was expressed that consumers will not handle EIO well - perhaps offlining sectors or the entire device in the consumer.
Potential Solutions

Queue IO

It is believed that external snapshot devices will be added relatively shortly after the lvol is registered. Rather than failing the IOs, it would be better to queue them. If the IOs cannot be completed within a timeout period (e.g. 30 seconds) they will be completed with an error.
Replace EIO device with Queue device

The blob_eio blobstore device is no longer needed and will be removed.
A new blob_queue blobstore device will be added. It will interact with the lvol load, missing esnaps, etc., parts of vbdev_lvol in much the same way as blob_eio does. The key difference is that the read, readv, and readv_ext callbacks will queue the IO. The replacement of a blob_queue with a blob_bdev is handled in lvs_esnap_missing_hotplug() as follows:

Thew new bs_dev is created with lvs->esnap_bs_dev_create()
The blob is frozen to prevent any new IOs from being submitted to the blob.
The queued IOs are submitted to the bs_dev.
The blob's back_bs_dev is replaced with spdk_blob_set_esnap_bs_dev().
The blob is unfrozen.

Each channel that has queued IOs will have a poller that is set to fire when the next queued IO times out. This poller's callback will complete all expired IOs in the queue with -EIO.
Avoiding examine disk delays

In some cases, queueing would be detrimental. The most apparent case is that of examine_disk() callbacks on a newly registered degraded lvol. Today, examine_disk() is called asynchronously with all bdev modules with registered examine_disk() callbacks racing. This is subject to change: see spdk#2855.
The fix for spdk#2855 discussed during a bug scrub or community call (or shortly thereafter) was to still call examine_disk() asynchronously, but switch to calling them sequentially. That is, one module's call to spdk_bdev_module_examine_done() would trigger the next to begin its examine_disk(). If an lvol's external snapshot is not available for a long period of time, this could mean (e.g.) 30 seconds of delay for each bdev module's examine_disk(). This would mean that spdk_bdev_wait_for_examine() could be waiting for several minutes, which is likely to be undesirable.
To avoid this, after opening a bdev each module's examine_disk() callback should call:
        spdk_bdev_desc_set_flag(bdev_desc, SPDK_BDEV_IO_FAILFAST);
The following will be added to lvol_read() and lvol_write():
        lvol_io->ext_io_opts.io_flags = spdk_bdev_io_get_desc_flags(bdev_io);
Note that it is needed for lvol_write() because a write can trigger a read if CoW is needed.
bs_dev_queue_readv_ext() will contain:
        if (io_opts->io_flags & SPDK_BDEV_IO_FAILFAST) {
                cb_args->cb_fn(cb_args->channel, cb_args->cb_arg, -EIO);
                return;
        }
        /* queue IO */
Open questions

This design strives to keep lvol's implementation of missing esnap handling out of the blobstore code. This means that queuing could happen in blobstore.c (queue in an spdk_bs_channel) and in a channel allocated by bs_dev_queue_create_channel(). I don't think this is a problem, but it calls for multiple queue implementations.
Delay registration

With this option, lvol bdevs are not created until the external snapshot is present.
Loading an lvolstore

As an lvolstore is being loaded, two passes are taken. The first iterates all blobs, creating per-blob lvol structures that are stored then added to lvs->lvols list. During the second pass, all blobs are opened and the corresponding lvols and vbdev_lvols are created.
In the first pass, the opens happen in the order of blob ID using spdk_bs_ister_first() and spdk_bs_iter_next(). When a clone blob is opened, the open of its parent is triggered such that the snapshot (or external snapshot) finishes opening before the clone opens. After a blob is opened, blobstore calls the callback specified with the spdk_bs_iter_* call, which is load_next_lvol(). Before load_next_lvol() loads the next lvol by continuing the iteration, it allocates an spdk_lvol and inserts it on the tail of lvs->lvols. This depth-first algorithm causes a snapshot to appear on lvs->lvols before its clones.
In the second pass, _vbdev_lvs_examine_cb() iterates through lvs->lvols and opens the lvols in the order they appear. Because of the order imposed by the first pass, we know that an esnap clone that is itself a snapshot will appear in lvs->lvols ahead of any clones. Thus, an esnap clone that fails to open its external snapshot can mark itself as degraded. If a regular clone can find its parent snapshot, it should then be able to determine if the parent is degraded. Blobstore already has a means for recursively getting data from parent devices and that can be extended to support is function.
spdk_bs_dev gets a new callback:
struct spdk_bs_dev {
    /* ... */

    bool is_healthy(struct spdk_bs_dev *dev);
};
Existing bs_dev modules get an is_healthy() implementation that returns true.
When an external snapshot is missing, the blob's back_bs_dev will be a minimal blobstore device that returns false to the is_healthy() callback.
Blobstore gets a new function:
bool
spdk_blob_is_healthy(struct spdk_blob *blob)
{
        if (blob->back_bs_dev != NULL) {
                return true;
        }
        return blob->back_bs_dev->is_healthy();
}
Missing devices are then largely handled as they are in the code currently under review. The following changes are needed:

Rather than loading an EIO bs_dev, an unhealthy bs_dev is loaded.
Do not register a bdev when spdk_blob_is_healthy() returns false. This catches esnap clones and dependents.
When a missing bdev appears, the examine_config() callback will trigger recursive iteration of lvol's clones to cause bdevs to be created for the now healthy lvols.

Bdev namespace reservation

Currently, lvstore prevents new lvols from being created when the new name collides with a name on the lvs->lvols list. No additional work is needed.  In theory, someone could create a non-lvol bdev with a name that collides with <lvs>/<vol>, but that seems pretty unlikely.
The UUID will not be reserved, but that is of minimal risk.
Listing degraded lvols

A new API will be added: bdev_lvol_get_lvols.  Despite starting with bdev this will iterate lvs->lvols and will dump the driver_specific data for each lvol. This will provide the result of blob->back_bs_dev_healthy() in a new boolean healthy key. The other changes added so far for external snapshots will remain in this json output.  See vbdev_lvol_dump_info_json().
Degraded lvol delete and rename

This can likely go on as it currently does, but will skip the spdk_bdev_*() calls.
Other lvol operations

An lvol that is not healthy (per spdk_blob_is_healthy(lvol->blob)) will not allow new snapshots or clones, cannot be resized, inflated, decoupled, or made read-only.