yallop/osxfs-caching.md Secret

## osxfs-caching.md

      
    Raw
  

              osxfs-caching.md
            
          
    Optimizing Generic File System Sharing from Bind Mounts

Summary

This document proposes an extension to bind mount semantics that unlocks
significant performance improvements for osxfs.  In preliminary
measurements, go list running time drops from 25.7 seconds to 7.6
seconds.
Background

With Docker distributions for an increasing number of platforms,
including macOS and Windows, generalizing mount semantics during
container run is a necessity to enable workload optimizations.
The current implementations of mounts on Linux and macOS provide a
consistent view of a host directory tree inside a container: reads and
writes performed either on the host or in the container are immediately
reflected in the other environment, and file system events (inotify,
FSEvents) are consistently propagated in both directions.
On Linux these guarantees carry no overhead, since the underlying VFS is
shared directly between host and container.  However, on macOS (and
other non-Linux platforms) there are significant overheads to
guaranteeing perfect consistency, since messages describing file system
actions must be passed synchronously between container and host.  The
current implementation is sufficiently efficient for most tasks, but
with certain types of workload the overhead of maintaining perfect
consistency can result in performance that is significantly worse than a
native (non-Docker) environment.  For example,


running go list ./... in a bind-mounted golang source tree takes
around 26 seconds


writing 100MB in 1k blocks into a bind-mounted directory takes
around 23 seconds


running ember build on a freshly created (i.e. empty) application
involves around 70000 sequential syscalls, each of which translates
into a request and response passed between container and host.


Optimizations to reduce latency throughout the stack have brought
significant improvements to these workloads over the last few months,
and a few further optimization opportunities remain.  However, even when
latency is minimized, the constraints of maintaining consistency mean
that these workloads are likely to remain unacceptably slow for some
users.
Fortunately, in many cases where the performance degradation is most
severe, perfect consistency between container and host is unnecessary.
In particular, in many cases there is no need for writes performed in a
container to be immediately reflected on the host.  For example, while
interactive development requires that writes to a bind-mounted directory
on the host immediately generate file system events within a container,
there is no need for writes to build artefacts within the container to
be immediately reflected on the host file system.  At present these two
cases are treated identically, but distinguishing between them will
allow us to significantly improve performance.
Several users are already using third-party solutions (e.g. rsync or
asynchronous NFS) that offer improved performance for particular use
cases at the cost of both consistency and other benefits such as proper
handling of permissions and file system events.  Allowing these users to
select between perfect consistency and improved performance on a
per-mount basis will greatly improve their Docker experience.  We also
anticipate being able to offer performance that surpasses NFS through
aggressive caching.
There are three broad scenarios to consider.  In each case the container
has an internally-consistent view of bind-mounted directories, but in
two cases temporary discrepancies are allowed between container and host.


consistent: perfect consistency

(host and container have an identical view of the mount at all times)


cached: the host's view is authoritative

(permit delays before updates on the host appear in the container)


delegated: the container's view is authoritative

(permit delays before updates on the container appear in the host)


These three options are described in more detail below ("Semantics").
Preliminary performance improvements

Preliminary measurements reveal that that new configurations can offer
immediate performance improvements.
In this section, 'preliminary' indicates that the implementation under
test is only using aggressive Linux VFS caching features rather than any
additional and significant structural caching opportunities enabled by
the below semantics. All measurements were performed on a pinata branch
from Beta 40. Parenthesized ratios represent the improvement in total
running time over the consistent configuration.  "With[out] disk image
synchronization" refers to whether F_FULLFSYNC is used on the pinata
qcow block device.


running go list ./... in a golang source tree (read-only workload):

without shared directory: 2.0s
with a consistent shared directory: 25.7s
with preliminary cached semantics cold: 14.4s (1.81×)
with preliminary cached semantics hot: 7.6s (3.38×)


performing many small writes (dd 100,000 1k blocks, write-only):

without disk image synchronization: 1.0s
with disk image synchronization: 1.6s
with a consistent shared directory: 22.7s
with preliminary delegated semantics: 1.9s (11.95×)


running ember build on a freshly created (i.e. empty) application

without disk image synchronization (unsafe): 10.1s
with disk image synchronization (default): 10.5s
with a consistent shared directory: 27.2s
with preliminary cached semantics cold: 22.4s (1.21×)
with preliminary cached semantics hot: 17.9s (1.52×)
with preliminary delegated semantics cold: 21.6s (1.26×)
with preliminary delegated semantics hot: 17.7s (1.54×)


Design

The bind options are the natural place to support the selection of
semantics.  PR28527 lists the existing options (type,
readonly, etc.).
We propose adding a single new bind option, state, with values
consistent (the default), cached, and delegated.  No additional
interface changes are needed for this proposal, although broadening the
interface (e.g. to specify per-mount policies around permissions, uids,
etc.) might be useful in future.  However, a number of additional
improvements to the interface are discussed below ("Design and
Development Considerations").
As the detailed semantics below show, ignoring the new option altogether
is an acceptable implementation of the various semantics on Linux. This
property enables cross-platform compatibility without any low-level
changes in the Docker engine.
Semantics

The semantics of each configuration is described as a set of guarantees
relating to the observable effects of file system operations.  In this
specification, "host" refers to the file system of the user's Docker
client.
delegated Semantics

The delegated configuration provides the weakest set of guarantees.
For directories mounted with delegated the container's view of the
file system is authoritative, and writes performed by containers may not
be immediately reflected on the host file system.  As with (e.g.) NFS
asynchronous mode, if a running container with a delegated bind mount
crashes then writes may be lost.
However, by relinquishing consistency, delegated mounts can offer
significantly better performance than the other configurations.  Where
the data written is ephemeral or readily reproducible (e.g. scratch
space or build artefacts) delegated may be optimal for a user's
workload.
A delegated mount offers the following guarantees, which are presented
as constraints on the container run-time:
(1) If the implementation offers file system events, the container state
as it relates to a specific event MUST reflect the host file system
state at the time the event was generated if no container modifications
pertain to related file system state.
(2) If flush or sync operations are performed, relevant data MUST be
written back to the host file system.  Between flush or sync
operations containers MAY cache data written, metadata modifications,
and directory structure changes.
(3) All containers hosted by the same run-time MUST share a consistent
cache of the mount.
(4) When any container sharing a delegated mount terminates, changes
to the mount MUST be written back to the host file system. If this
writeback fails, the container's execution MUST fail via exit code
and/or Docker event channels.
(5) If a delegated mount is shared with a cached or a consistent
mount, those portions that overlap MUST obey cached or consistent
mount semantics respectively.
Besides these constraints, the delegated configuration offers the
container run-time a degree of flexibility:
(6) Containers MAY retain file data and metadata (including directory
structure, existence of nodes, etc) indefinitely and this cache MAY
desynchronize from the file system state of the host. Implementors are
encouraged to expire caches when host file system changes occur but,
due to platform limitations, may be unable to do this in any specific
time frame.
(7) If changes to the mount source directory are present on the host
file system, those changes MAY be lost when the delegated mount
synchronizes with the host source directory.
However,
(8) Behaviors 6-7 DO NOT apply to the file types of socket, pipe, or device.
cached Semantics

The cached configuration provides all the guarantees of the
delegated configuration and some additional guarantees around the
visibility of writes performed by containers.  For directories mounted
with cached the host's view of the file system is authoritative;
writes performed by containers are immediately visible to the host, but
there may be a delay before writes performed on the host are visible
within containers.
(1) Implementations MUST obey delegated Semantics 1-5.
Additionally,
(2) If the implementation offers file system events, the container state
as it relates to a specific event MUST reflect the host file system
state at the time the event was generated.
(3) Container mounts MUST perform metadata modifications, directory
structure changes, and data writes consistently with the host file
system, and MUST NOT cache data written, metadata modifications, or
directory structure changes.
(4) If a cached mount is shared with a consistent mount, those portions
that overlap MUST obey consistent mount semantics.
Some of the flexibility of the delegated configuration is retained,
namely:
(5) Implementations MAY permit delegated Semantics 6.
consistent Semantics

The consistent configuration places the most severe restrictions on
the container run-time.  For directories mounted with consistent the
container and host views are always synchronized: writes performed
within the container are immediately visible on the host, and writes
performed on the host are immediately visible within the container.
The consistent configuration most closely reflects the behaviour of
bind mounts on Linux.  However, the overheads of providing strong
consistency guarantees make it unsuitable for a few use cases, where
performance is a priority and maintaining perfect consistency has low
priority.
(1) Implementations MUST obey cached Semantics 1-4.
Additionally,
(2) Container mounts MUST reflect metadata modifications, directory
structure changes, and data writes on the host file system immediately.
default Semantics

The default configuration is identical to the consistent
configuration except for its name. Crucially, this means that cached
Semantics 4 and delegated Semantics 5 that require strengthening
overlapping directories do not apply to default mounts. This is the
default configuration if no state flags are supplied.
Design and Development Considerations

Besides the additional bind option described above, users could ideally
use a configuration file, extended attribute, application database, or
Dockerfile to persistently specify mount sources that should always be
associated with certain consistency properties. Without this capability,
users must either each learn of these flags or rely solely on scripts
that wrap the Docker command line and provide these flags. Additionally,
many common developer use cases such as mounting a source code directory
would benefit from mounts that use different consistency properties on
different subdirectories. For example, project build directories could
likely be delegated when running a build monitoring container whereas
source, storage, or control directories should not be delegated.
Finally, the delegated mode requires the ability to interpose on a
container shutdown sequence and potentially alter its success
status. The capability may be present in runc or containerd but it
is not clear how or if it is exposed to the Docker run-time.
Conclusion

With minimal client change, the cached and delegated mount flags
offer the potential for large speedups of shared file system performance.
Feedback

Please contact dsheets@docker.com and yallop@docker.com with questions,
concerns, suggestions, or additional resources.