Skip to content

Instantly share code, notes, and snippets.

@ShyamsundarR
Last active May 25, 2020 00:19
Show Gist options
  • Save ShyamsundarR/b187f2f843a4326c5087fcf77852e376 to your computer and use it in GitHub Desktop.
Save ShyamsundarR/b187f2f843a4326c5087fcf77852e376 to your computer and use it in GitHub Desktop.
CephFS-CSI Integration: Supported features and gaps

CephFS integration with CSI

This document aims to list, at a high level, CSI 1 requirements and how these are met by existing CephFS features, or are features that need to be developed or improved in CephFS. It also covers features that are needed in ceph-csi, to close the gap between RBD and CephFS integrations.

NOTE: This is to serve as a running document, and would get modifications as the CSI specification evolves and its implementation in ceph-csi 2 evolves over time, or as more insight is gleaned for the existing requirements.

Summary

This section captures the gaps and potential future requirements (and associated trackers) based on the analysis in the subsequent sections. It should be sufficient to read through this section, for features and gaps, for non-CSI implementors.

The subsequent sections starting from Feature classification should be of interest to CSI implementors, and provides more detailed background for the requirements summarized here.

  1. Snapshots and clones

    • MANDATORY CephFS snapshot should be independent of the subvolume being snapshotted, and support parent volume deletes
      • The desired independence is only from the perspective of volume deletes when a snapshot of the volume exists, and the ability to clone a snapshot when the parent volume is deleted.
      • [TODO] Investigate if this is already the case, and if the parent volume is moved to "trash" and garbage collected once dependent snapshots are deleted.
        • There is a caveat noted in the subvolume operations when deleting (subvolume rm) a volume as follows, "The removal of a subvolume fails if it has snapshots, or is non-existent."
      • RBD is providing the said solution by always providing a clone for a snapshot request, rather than use RBD snapshots as the representation for the clone
      • Tracker:
        • None
    • MANDATORY CephFS should support an interface that can fetch metadata regarding created snapshots
      • Snapshot requests may be retried/replayed, to respond back with the right metadata about snapshots that are already created, an interface like ceph fs subvolume info for snapshots is required
      • Support for the same already exists and is integrated into ceph-csi for RBD
      • Tracker:
        • None
    • DESIRABLE CephFS snapshot protection prior to cloning should be handled by Ceph
      • CephFS snapshot requires protection prior to cloning. This workflow has been revised with RBD, where the snapshot being cloned is protected internally and further can be deleted (ends up in "trash") and its subsequent garbage collection is deferred.
      • The requirement is to make it simpler for ceph-csi to use a similar workflow to RBD for the purpose of cloning.
      • NOTE: This requirement is not a prerequisite for clone operation to be implemented in ceph-csi. Ceph-csi can leverage the existing workflow, and further a CSI snapshot undergoing cloning may not be deleted as per the CSI protocol, as it is in use.
      • Tracker:
    • REJECTED CephFS should provide an interface to clone a volume from a volume, rather than ceph-csi having to snapshot and clone the volume as a 2 step operation.
      • This is a requirement that simplifies ceph-csi implementation
      • Initial doubts were regarding, if the source volume should be snapshotted by CephFS prior to cloning, or is this not required. Further investigation and discussion on the enforcing of the source volume not being in use, has been clarified by the k8s community. The summary being, it is the plugins and users responsibility to ensure data consistency of the source volume.
      • The workflow hence is a to snapshot and then clone, and would be in line with the workflow for an explicit snapshot, followed by a clone of the snapshot
      • RBD also had a similar requirement and it stands rejected currently
      • Tracker:
    • FUTURE CephFS clones are full copies, hence to backup a volume the workflow by any backup operation would involve copying the volume content as part of a clone operation and subsequently copying the created volume contents to a backup store. This makes it a double copy operation, and would be inefficient.
      • CSI protocol also does not support any means to use a snapshot of a volume more directly, say as a read only mount for such purposes
      • [TODO] Investigate options where a clone request can carry any annotations that can enable this to be a light weight clone of the snapshot, by ceph-csi, rather than a full clone
      • Tracker:
        • None
    • FUTURE CephFS should provide a mechanism to freeze/unfreeze a mounted volume, like fsfreeze
      • CSI protocol may be enhanced to support better data consistency while taking snapshots. A newer request being discussed is the Freeze/Unfreeze request, that needs support by CephFS
      • Tracker:
        • None
    • FUTURE CephFS should have the ability to generate snapshot delta between 2 given snapshots, to enable backup vendors or data transfer agents, to optimize local filesystem inspection for changed data and to lower data transfer across networks
      • Backup and data protection vendors desire the ability to take incremental or differential backups
      • The typical means to achieve this with a filesystem would be to inspect the last modified time stamps of all inodes and take appropriate actions
      • Instead of the typical mechanism of crawling the file system and backup full contents of every file, it would be desirable to have a delta list between 2 snapshots, that can help optimize this operation
      • NOTE: This requirement should be evaluated based on interest and also based on how this would be exposed in the future
      • Tracker:
        • None
  2. Choice of mounter

    • FUTURE If CephFS subvolumes should be mounted via FUSE, rather than using the kernel mounter, ceph-csi needs to solve the problem of maintaining the existing mounts due to any node service restarts
      • NOTE: This is a ceph-csi requirement and not a CephFS requirement, but it maybe driven due to CephFS needs to use the FUSE mounter
      • NOTE: RBD also requires a similar ceph-csi handling if the direction is to use rbd-nbd driver in the future
      • Tracker:
        • None
  3. Topology based provisioning support

    • MANDATORY ceph-csi should integrate with ceph fs subvolume info to support topology based provisioning
      • This is a ceph-csi gap, but noted here as it was dependent on the CephFS tracker which is now complete
      • Topology based provisioning for RBD is already supported in ceph-csi
      • Tracker:
    • FUTURE Enhance ceph-csi topology support leveraging multi-mds and subtree pinning features of CephFS
      • This is a ceph-csi gap, but noted here to clarify if this is feasible before creating the required trackers
      • Currently, topology based support chooses a datapool for a subvolume closer to where the workload would be scheduled. This makes all MDS operations potentially cross topology boundaries. To be strictly topology constrained, if multi-MDS and subtree pinning could be leveraged, an MDS closer to the node where the workload is scheduled can be pinned to the subvolume.
      • RBD does not have a separate MDS as such, and hence is fully topology constrained as it stands
      • Tracker:
        • None
  4. Mirroring for DR or related use cases

    • FUTURE CephFS should provide a mirroring solution at a subvolume granularity, such that a standby DR site can takeover operations in case of distasted
      • This requirement is outside the scope of ceph-csi, although CSI should be able to respond back with the same CephFS mirrored volume on the DR site when provisioning and life cycle requests are made
      • RBD has mirroring feature that is available, but is yet to be integrated to ceph-csi
      • Tracker:
        • None
  5. Encryption

    • FUTURE CephFS should provide for a client side per subvolume encryption feature
      • This feature can then be leveraged by ceph-csi and any KMS on the client side, to provide per volume encryption
      • RBD has LUKS based encryption support that is already leveraged by ceph-csi
      • Tracker:
        • None
  6. Compression

    • FUTURE CephFS should support IO hints to the OSD on data compression
      • IO hints are supported in Ceph that help avoid or force compression of specific IO requests. These hints seem absent in CephFS clients, and may need consideration
      • This does not impact ceph-csi directly, but is a requirement that is satisfied (but not integrated into ceph-csi) by RBD and hence may need consideration
      • Tracker:
        • None
    • FUTURE CephFS should support in flight compression to reduce network bandwidth usage
      • This is useful in cloud environments where cross topology traffic is charged, hence any optimization at the cost of relative performance is desired
      • The assumption is that the messenger protocol would be enhanced to compress data, hence this is not strictly a CephFS concern
      • Tracker:
  7. Ceph-go integration to all CephFS CLIs

    • MANDATORY All CephFS CLIs and interfaces used by ceph-csi should have an equivalent ceph-go API binding
      • ceph-go based API invocations are resource friendly, and is the forward looking path for ceph-csi
      • This requirement is more a ceph-go repository requirement and is being handled by the maintainers of the same, but noted here for completeness
      • Tracker:
        • None
  8. Miscellaneous

    • FUTURE CephFS should be able to return a list of nodes where a volume is mounted
      • Certain CSI requests, like ListVolumes and ControllerGetVolume (alpha), require an OPTIONAL response regarding where the volume is mounted
      • Currently both requests are not supported by ceph-csi, owing to the latter being in alpha state, and the former (ListVolumes) not providing required secrets for operating against a ceph cluster
      • Tracker:
        • None
    • FUTURE CephFS subvolume list should provide info for listed subvolumes as an optimization, as instead a list needs to followup with a info for each listed item
      • Ideally both ListSnapshots and ListVolume also need extra information regarding each volume or snapshot that is covered in ceph fs subvolume info, as a result as a future optimization it may help to have an ls -l equivalent that returns the extra metadata for each item listed.
      • Tracker:
        • None
    • FUTURE CephFS client mounts need a mechanism to detect if a mount is healthy
      • Certain CSI requests, ListVolumes, NodeGetVolumeStats and ControllerGetVolume (alpha), have an OPTIONAL response field indicating if the health of the volume
      • Currently only NodeGetVolumeStats is supported by ceph-csi, and is not returning this response as it is OPTIONAL
      • NOTE: This maybe a ceph-csi concern to determine the health of the mount, and not a CephFS concern, but is noted here in case additional CephFS support is required
      • Tracker:
        • None
    • FUTURE Multi-tenant noisy neighbor prevention [TODO]
      • There could be additional requirements in this regard, dealing with fairness across different workloads (tenants?) using the same storage provider.
    • UNKNOWN CephFS should have the ability to fence stale clients
      • CephFS would need the ability to fence or disregard stale mounts and possibly blacklist them, to prevent inadvertent modifications from the stale client
      • A stale client is typically when a volume is used as are read-write by a single node only
      • This is a more kubernetes and CSI environment requirement, and needs clarification in the specifications, but noted here for completeness and any CephFS constructs that may need to be provided
      • Tracker:

Feature classification

Features are classified into,

  • Eco-system considerations for Container Orchestrator (CO, typically kubernetes) and CSI deployments
  • CSI requests (gRPCs)
  • Storage backend features that can be exposed to CO environments

The first set is to cover features needed due to various environmental factors of CSI and COs. The second set covers the various CSI calls and its resulting requirements, and the last set covers storage backend features that can be exposed to the CO (e.g. data compression, encryption).

Eco-system considerations for CO and CSI deployments

  1. Ability for a CO to perform storage volume life cycle management on CephFS

    • Storage lifecycle would include, create/mount/snapshot/clone/delete/resize and related operations
    • This is primarily supported by the ceph fs subvolume interface provided by CephFS 3
    • All current lifecycle operations are covered in the CSI requests section
  2. Ability for multiple CO instances to use the same instance of CephFS

    • This is provided by CephFS using the ceph fs subvolumegroup interface, and helps isolate the various instances based on the subvolumegroup name
    • Fine grained cephx id and keys that restrict access to created subvolumegroup, provide required authentication/access isolation
    • GAP: This is already supported in ceph-csi for CephFS, there is a reverse gap in RBD which is being addressed here
  3. Ability for a single CO instance to achieve logical isolation of volumes created using the same CephFS instance

    • This can be looked at as a sub-part of the previous requirement, except this is from the same CO instance
    • The solution remains the same, i.e to use ceph fs subvolumegroup interface
    • The caveat is to ensure that the volume ID/name per volume is unique across the entire CO instance. This is not a CephFS concern, but something to be aware of, that the CSI plugin will ensure
    • FUTURE: There could be additional requirements in this regard, dealing with fairness across different workloads (tenants?) using the same storage provider.
  4. CSI nodeplugin restarts

    • CSI nodeplugin operates on every node that requires a mount of a volume
    • For upgrade or related reasons, the nodeplugin service maybe restarted, and this can interfere with the existing mounts becoming stale if FUSE is used as the mounter
    • The default or suggested mounter to use hence, is the kernel CephFS mounter
    • NOTE: If there are strong reasons to use the FUSE mounter, the stated use case of node plugin restarts would need to be handled
  5. Fencing

    • Mostly to deal with volumes that are expected to be mounted on a single node for read/write cases
    • GAP: CephFS would need the ability to fence or disregard stale mounts and possibly blacklist them, to prevent inadvertent modifications from the stale client
    • This is currently being debated in this issue at the CO/CSI layers
  6. Scale [TODO]

    • To note down typical scaling parameters, number of volumes/snapshots, sizes, active/passive IO rates
    • Most of this may just end up as guesses that will change over time and across users
  7. CSI secrets

    • CSI requests are made using secrets that pertain to the operation context. These secrets, for example, are mentioned in the StorageClass, a k8s construct that specifies parameters for the request and also which driver would handle requests made using the StorageClass. Dynamic volume requests are made referencing a StorageClass, that is then used to pass on the secrets and parameters to the respective storage plugin CSI request
    • In the case of ceph, secrets contain cephx ID and Key
    • All CSI requests do not contain the secrets field, as a result a subset of requests, that require secrets to operate against the ceph cluster, are not supported by ceph-csi
    • FUTURE: There is an open proposal to move secrets from the CSI requests to the CSI controller and node services instead. This helps serve requests that do not pass any secrets, and is the accepted pattern by the CSI, k8s community and other storage vendors.
      • Some requests that stand unsupported due to the CSI secrets constraint hence may need to be supported if the secrets are moved into the CSI plugin instance
  8. Minimum Ceph versions across the client and storage servers

    • [TODO] Based on the subvolume group of interfaces, and newly added interfaces for CSI support, it would help to list out what minimum versions of Ceph and clients are required to support the stack.

CSI requests

CSI requests (or, gRPC calls) are separated into 2 categories,

  • Controller services (controller plugin)
  • Node services (node plugin)

The controller service is responsible for most of the CSI volumes lifecycle management activities. The node service is responsible to mount and manage "use" of the created volumes. The node service does not participate in the IO path, and is only responsible to setup the volume for access by the workload.

Controller services

  1. CreateVolume

    • Creates a volume of a given size and capability (capabilities include access and type, as detailed further below)
    • Primarily relies on CephFS subvolume series of commands to create the required volume
    • There are 2 other variants for create volume, which are based on a VolumeContentSource field in the create request as follows,
      • Create a volume from a snapshot of another volume (VolumeContentSource is a CephFS snapshot)
      • Create a volume from another volume (clone) (VolumeContentSource is another CephFS volume)
        • The suggested workflow for ceph-csi is to create a snapshot and clone the volume from the same, thereby reusing existing ceph fs subvolume snapshot and clone operations
        • From a user perspective, it maybe prudent to instead create a non-ephemeral snapshot of the volume to clone from, and reuse the clone from snapshot version of CreateVolume instead for efficiency reasons
        • NOTE: There is a ticket open to support clone inherently by teh ceph CLI, but this is not a blocker for the feature implementation in ceph-csi
    • NOTE: It is feasible that other forms of VolumeContentSource requests may be standardized in the future
    • Volumes have an attribute of access mode that defines how the volume is intended to be accessed, these being,
      • Single node reader
      • Single node reader/writer
      • Multi node reader
      • Multi node reader/writer
      • Multi node reader, single node writer
      • The one unknown is how, single node writer multiple readers, is required to be implemented and who owns the restriction that there is ever only a single writer
        • kubernetes does not support this mode as of this writing, and as a result for now this is not a requirement that needs addressing in this environment, nor can further details be elaborated on whose responsibility this would be eventually
      • The rest of the access modes can be supported by CephFS, as the actual ensuring of the fact that the volume is read only or writeable as well, is left to the CO and CSI, based on how it needs to be mounted (e.g ro/rw)
    • Volumes are also classified as Block or Mount volume types, and with CephFS the type is always going to be "Mount". This is again controlled and filtered by CSI and not a CephFS concern
    • [TODO] Topology based provisioning constraints
      • Short note, CephFS has everything in place to use a single non-topology constrained MDS and different topology constrained data pools to provision topology constrained volumes
      • FUTURE: With the MDS subtree pinning feature, and multi-MDS support in the horizon(?), the topology constraints can be extended to the MDS as well
    • GAP: There is no way to use created CSI snapshots of a volume directly as per CSI protocol. IOW, a snapshot cannot be mounted read only for activities such as backing up the snapshot contents or replicating the same across storage clusters
      • The usage hence is to create (clone) a volume from a snapshot as its VolumeContentSource before use
      • This makes the current CephFS clone operation to gain access to data in the snapshot more heavy weight for such use cases as backup
    • GAP: Currently clone from a snapshot for CephFS is a full copy, hence time to create such volumes are indeterminate (depends on the amount of data and metadata to copy)
      • This hence makes the CreateVolume CSI call unresponsive, and alternatives to not making this hang are discussed and noted here
    • GAP: RBD snapshots no longer need to be protected during cloning, and further are automatically placed in the trash when such snapshots that are being cloned are deleted. This makes it a desirable feature in CephFS as well, to avoid the explicit protection requirements. The gap is not a MUST address gap, but more to align the workflow across RBD and CephFS
    • GAP: To support request retires, if a volume being created already exists, the CSI plugin would need to read its attributes and respond with the corresponding size, time and relevant metadata for the request. This is satisfied using the ceph fs subvolume info interface that is provided. ceph-csi is yet to integrate with this to provide required correctness in such scenarios.
    • FUTURE Ability to restore clone a snapshot or volume to a pool different than the source pool. This already stands supported with CephFS subvolume clones, as the full copy of the filesystem can be cloned to a different data pool layout as desired.
  2. DeleteVolume

    • Deletes a volume that was created using the CreateVolume request
    • Deletion should be independent of existing snapshots for the volume, IOW it should be possible to delete a volume that has existing snapshots, and further these snapshots could still be used, in the future, to create a new volume by cloning the same
    • GAP: There is a caveat noted in the subvolume operations when deleting (subvolume rm) a volume as follows, "The removal of a subvolume fails if it has snapshots, or is non-existent.".
      • Need to understand if the subvolume would remain in trash, and hence not a concern from the CSI request perspective (i.e DeleteVolume will be a success, but volume is not deleted in CephFS and remains in trash), or the CSI deletion request itself would fail till all snapshots are deleted.
  3. Controller[Publish/Unpublish]Volume

    • These are not implemented by ceph-csi
    • These are to control which node a volume can be published to (i.e mounted and used), and serves as a check with the controller service before requesting the node service to mount the same
    • NOTE: There were inclinations to use this where fencing is required, but the way forward is not yet designed
  4. ValidateVolumeCapabilities

    • Given a CSI volume, returns various volume access modes and access types
    • Involves ability to inspect CephFS subvolume for attributes of interest, typically size (quota), time stamps, data pool parameter
    • Currently ceph-csi does not call into CephFS to validate any fields, and only checks if the newly requested capabilities do not include "Block"
    • FUTURE: In the future is ceph-csi needs to call into CephFS for the subvolume info, the ceph fs subvolume info interface would address the requirement
  5. ListVolumes

    • It has been deliberated and decided not to support this RPC via ceph-csi
    • This request also has the issue of not carrying the CSI secrets in its request
    • Further this request does not carry the provisioning parameters of interest, that help narrow down where to list volumes from (e.g pool, subvolume groupname)
    • In the future if this is required, CephFS subvolume group of commands have the subvolume ls CLI that ceph-csi can leverage to provide the required data
  6. ControllerGetVolume (Alpha)

    • This is a CSI alpha feature, meaning it is completely experimental and may not be made part of the specification
    • Intention is to return if a volume is healthy (VolumeCondition field) and also list of nodes the volume is published on (VolumeStatus field)
    • This is currently not implemented in ceph-csi
    • This request also does not carry the CSI secrets
    • [TODO] Need to list out how this can be achieved with CephFS, when it requires support in ceph-csi
    • NOTE: VolumeStatus is also returned in the ListVolumes request, which is unsupported in ceph-csi
      • For RBD the, to be investigated, option could be to leverage the watcher output, to return a list of nodes currently using the volume
      • Unsure if CephFS also maintains watchers, and if this is a fairly reliable source of truth
    • NOTE: VolumeCondition is also returned in ListVolumes (unsupported), and NodeGetVolumeStats requests. The latter, in ceph-csi, currently does not return a VolumeCondition as the field is optional in the response
  7. GetCapacity

    • Intention of this request is to return available capacity of the storage backed, given parameters of provisioning (e.g subvolumegroup, data pool)
    • The provisioning parameters are the same as those sent in a CreateVolume request, which inturn identify the pool, data pool, ceph cluster of choice etc. IOW this enables us to zero in on a subvolume group and return available bytes left within the group (if such restrictions are possible)
    • This is currently unsupported by ceph-csi as this request does not come with required [CSI Secrets]. As a result querying Ceph/CephFS for available storage capacity is not feasible.
  8. ControllerGetCapabilities

    • Purely a CSI instance to CO communication on various features flags that the CSI controller plugin supports
  9. CreateSnapshot

    • Create a snapshot of a volume
    • GAP: Currently this is not implemented in ceph-csi in any form for CephFS, an initial alpha version of the same exists for RBD but is being revamped to not have any dependency between a CSI snapshot and its source volume
    • ceph fs subvolume snapshot group of commands would satisfy the integration point requirements for ceph-csi
    • GAP: Like create requests that may be retired, snapshot requests may also be replayed. To respond back with the right metadata about snapshots that are already created, an interface like ceph fs subvolume info for snapshots is desired.
    • CSI Snapshots will have a separate lifecycle independent of the originating volume. For example, it is possible that the original volume needs to be restored from a snapshot, in which case it would be deleted and recreated from the CSI snapshot as the VolumeContentSource. As a result the storage backed also needs to be able to support such isolation, even if it is synthetic.
      • NOTE: It is feasible that a clone from a snapshot can be created as a newly named volume, and then the older volume deleted or garbage collected as a workflow by the users, but having the above independence eases the workflow substantially
      • QUESTION: Can CephFS volumes containing snapshots be deleted (even if they still live on in trash) and then subsequently the snapshots be accessed for clone operations?
  10. DeleteSnapshot

    • Deletes a snapshot
    • No relevant discussion required here, covered by the ceph fs subvolume snapshot interface
  11. ListSnapshots

    • Currently not implemented or planned to be implemented for ceph-csi. Primarily owing to the CSI secrets issue, and also the request not carrying the CreateVolume parameters to zero in on which pool/group the listing should be from
    • CephFS already has the interface ceph fs subvolume snapshot ls that returns said data
      • FUTURE: Ideally both ListSnapshots and ListVolume also need extra information regarding each volume or snapshot that is covered in ceph fs subvolume info, as a result for a future optimization it may help to have an ls -l equivalent that returns the extra metadata for each item listed.
  12. ControllerExpandVolume

    • Expand an existing volume, supported via ceph fs subvolume resize in CephFS
    • Expansion can be online/offline, which is fine w.r.t CephFS as it changes the quota which is supported when the volume is in use
    • This is an expansion request and not a shrink, hence only expansion requires support (although with quotas this is immaterial)
    • GAP: ceph-csi should ideally inspect the current size and not resize the volume to the new size if the new size is smaller, as per the CSI specification. The specification states, the volume should be at least as large as the request, and if already bigger can just respond back with success. For inspection of current size the interface ceph fs subvolume info is available, and hence if ceph-csi needs to update its checks, is feasible to do so.

Node services

  1. Node[Stage/Unstage]Volume

    • Request for the initial global mount of a CSI volume on a node
    • For CephFS this would mount the subvolume path using the mounter of choice (FUSE/kernel), and further needs an interface to convert the subvolume to a path in the CephFS instance for the mount, which is provided for using ceph fs subvolume getpath interface
    • NOTE: As noted in CSI nodeplugin restarts section, if using FUSE as the mounter, and the nodeplugin is restarted, all mounts would become stale.
  2. Node[Publish/Unpublish]Volume

    • Request for a subsequent bind mount of the global mount to a specific workload path in the node
    • This is a bind mount that is executed on the node and hence there is no interaction with CephFS required at this stage
    • NOTE: A bind mount may add the read only flag to the bind mount, when the global mount is a rw mount, to support various volume access types as detailed in CreateVolume
  3. NodeGetVolumeStats

    • Used to get statfs information about mounted volume on a node
    • statfs output for a CephFS subvolume should reflect maximum and free inodes and block information at the subvolume granularity. This is already the case and hence supported.
  4. NodeGetCapabilities

    • Purely a CSI instance to CO communication on various features flags that the CSI node plugin supports
  5. NodeGetInfo

    • Mostly CSI node specific data exchange between CSI and the CO
    • There is one option in the response that states now many volumes this node can support. If in the future it is required to control the number of volumes per node, given node characteristics, this maybe leveraged to control the maximum mounted instances per node.
  6. NodeExpandVolume

    • Post expansion of the volume on the controller, a live mount may require a resize on published nodes. This request is to achieve hte same
    • This request is a NOP for CephFS as once the quota is reset on the subvolume, the mount would get refreshed with the updated values subsequently
  7. Node[Freeze/Unfreeze] FUTURE

    • Upcoming proposal to add a node level volume Freeze/Unfreeze operation
    • Intention of freeze is to pause changes to the volume, till it is unfrozen
    • FUTURE: Explore (or elaborate) ways in which to freeze CephFS subvolume mounts. Would fsfreeze be an alternative here?

Additional storage features

  1. Encryption

    • GAP: RBD client side per-volume encryption is supported in ceph-csi (using LUKS and integrated to VAULT as the KMS). There is currently no solution available for CephFS
  2. Compression

    • Pool level compression settings is configurable for all pools, that back either RBD or CephFS. This is done via Rook and addresses at rest compression
    • GAP: IO hints are supported in Ceph that help avoid or force compression of specific IO blocks. These hints seem absent in CephFS clients, and may need consideration
    • FUTURE: In flight compression is possibly at the transport layer (ceph messenger v2), to keep it generic across all protocols using the same
  3. DR and Backup/Restore

    • The bulk of Backup/Restore is to take snapshots periodically and back them up to a backup vendor controlled data store
    • The ability to clone snapshots is required to access the snapshot data as per the CSI protocol
    • CONCERN: As current CephFS clones are full logical filesystem copies, when used to backup purposes this would result in a double copy of the data
    • Mirroring is the other sought after solution, more in the disaster recovery space than for long term data retention and backup
      • RBD supports mirroring, and a prototype was created with ceph-csi to demonstrate its DR capabilities
      • GAP: CephFS does not have a mirroring solution yet, and a proposal is in the works for the same
    • FUTURE: Ability to restore a snapshot to a different pool than the originating volume is a desirable feature. This already stands supported with CephFS subvolume clones, as the full copy of the filesystem can be cloned to a different data pool layout as desired.
    • FUTURE: Ability to generate snapshot delta between 2 given snapshots, this enables backup vendors or data transfer agents, to optimize local filesystem inspection for changed data and to lower data transfer across the networks
  4. go-ceph 4 API bindings for all interface

    • GAP: Not all interfaces that are used by ceph-csi has an equivalent mapping to go-ceph. Especially some of the extended manager commands. This is required to ensure performance, scale and resource consumption optimization of the controller service and node service, as using multiple CLI invocations is both costly in terms of time to completion of the request, and also resource intensive when multiple CLIs are executed in parallel (for parallel requests).
  5. UID/GID mapping [TODO]

References

[1] Container Storage Interface (CSI) specification

[2] Ceph-csi integration

[3] CephFS subvolumes

[4] Go bindings for Ceph

Eco-system projects and groups of interest

  1. CSI specification
  2. WG notes/meetings:
    • Storage
    • Data Protection
    • NOTE: It is probably best to keep track of the meeting minutes and notes from the WG meetings
  3. k8s sidecar repositories:
  4. KEPs (Kubernetes Enhancement Proposals)
    • sig-storage
    • NOTE: There are other SIGs within the KEPs that may have related enhancements to storage, and are possibly best kept track of using the label sig/storage
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment