chetan/mesos_isolators.md

## mesos_isolators.md

      
    Raw
  

              mesos_isolators.md
            
          
    List of Isolators

Side note: all available resource metrics are documented here:
https://github.com/apache/mesos/blob/037a346a205ad7bdba99d771855f8caeea835d4a/include/mesos/mesos.proto#L1015
Filesystem Isolators

These are used for isolating files on disk from both the host system as well as other running tasks.
filesystem/posix

Generic POSIX-compatible file isolation. Essentially creates a folder which is owned by the task user/group.
filesystem/windows

// TODO(hausdorff): (MESOS-5462) For now the Windows isolators are essentially
// direct copies of their POSIX counterparts. In the future, we expect to
// refactor the POSIX classes into platform-independent base class, with
// Windows and POSIX implementations. For now, we leave the Windows
// implementations as inheriting from the POSIX implementations.

filesystem/linux

Linux-specific isolation using mount namespaces.
filesystem/shared

// This isolator is to be used when all containers share the host's
// filesystem.  It supports creating mounting "volumes" from the host
// into each container's mount namespace. In particular, this can be
// used to give each container a "private" system directory, such as
// /tmp and /var/tmp.

Being deprecated in favor of filesystem/linux
Runtime  Isolators

These isolators are used to ensure that a task behaves well at runtime and also provide runtime usage metrics for the given resource.
posix/cpu

No actual resource isolation but does support returning usage metrics.
Metrics: cpu user time & system time
See: https://github.com/apache/mesos/blob/037a346a205ad7bdba99d771855f8caeea835d4a/src/usage/usage.cpp#L35
posix/mem

No actual resource isolation but does support returning usage metrics.
Metrics: mem_rss_bytes
See: https://github.com/apache/mesos/blob/037a346a205ad7bdba99d771855f8caeea835d4a/src/usage/usage.cpp#L35
posix/disk

Uses du -k -s to ensure tasks stay within disk usage limits.
Can Kill Tasks? Yes
Metrics: disk_limit_bytes, disk_used_bytes
// This isolator monitors the disk usage for containers, and reports
// ContainerLimitation when a container exceeds its disk quota. This
// leverages the DiskUsageCollector to ensure that we don't induce too
// much CPU usage and disk caching effects from running 'du' too
// often.

disk/du

Alias for posix/disk
Can Kill Tasks? Yes
disk/xfs

The XFS Disk isolator uses XFS project quotas to track the disk space used by each container sandbox and to enforce the corresponding disk space allocation. Write operations performed by tasks exceeding their disk allocation will fail with an EDQUOT error. The task will not be terminated by the containerizer.
The XFS disk isolator is functionally similar to Posix Disk isolator but avoids the cost of repeatedly running the du. Though they will not interfere with each other, it is not recommended to use them together.
Metrics: disk_limit_bytes, disk_used_bytes
windows/cpu

// A basic MesosIsolatorProcess that keeps track of the pid but
// doesn't do any resource isolation. Subclasses must implement
// usage() for their appropriate resource(s).
//
// TODO(hausdorff): (MESOS-5462) For now the Windows isolators are essentially
// direct copies of their POSIX counterparts. In the future, we expect to
// refactor the POSIX classes into platform-independent base class, with
// Windows and POSIX implementations. For now, we leave the Windows
// implementations as inheriting from the POSIX implementations.

cgroups/cpu

Uses Cgroups cpu and cpuacct subsystems:
cpu
       Cgroups can be guaranteed a minimum number of "CPU shares"
       when a system is busy.  This does not limit a cgroup's CPU
       usage if the CPUs are not busy.

       Further information can be found in the kernel source file
       Documentation/scheduler/sched-bwc.txt.

cpuacct
       This provides accounting for CPU usage by groups of tasks.

       Further information can be found in the kernel source file
       Documentation/cgroup-v1/cpuacct.txt.

(from cgroups(7) man page)
// Use the Linux cpu cgroup controller for cpu isolation which uses the
// Completely Fair Scheduler (CFS).
// - cpushare implements proportionally weighted scheduling.
// - cfs implements hard quota based scheduling.

Metrics: processes, threads, cpus_user_time_secs, cpus_system_time_secs
Additional metrics when using CFS: cpus_nr_periods, cpus_nr_throttled, cpus_throttled_time_secs
https://github.com/apache/mesos/blob/037a346a205ad7bdba99d771855f8caeea835d4a/src/slave/containerizer/mesos/isolators/cgroups/cpushare.cpp#L446
cgroups/devices

// This isolator uses the cgroups devices subsystem to
// restrict access to devices in `/dev`. A small set of
// default devices are whitelisted upon container creation,
// and access to all other devices is restricted. It is
// assumed that other isolators will be used to allow / deny
// access to devices outside the default whitelist.

Whitelist
 devices
        This supports controlling which tasks may create (mknod)
        devices as well as open them for reading or writing.  The
        policies may be specified as whitelists and blacklists.
        Hierarchy is enforced, so new rules must not violate existing
        rules for the target or ancestor cgroups.

        Further information can be found in the kernel source file
        Documentation/cgroup-v1/devices.txt.

(from cgroups(7) man page)
Metrics: none
cgroups/mem

Cgroups memory subsystem:
 memory
        The memory controller supports reporting and limiting of
        process memory, kernel memory, and swap used by cgroups.

        Further information can be found in the kernel source file
        Documentation/cgroup-v1/memory.txt.

Can Kill Tasks? Yes
Metrics:
mem_total_bytes

// Total memory + swap usage. This is set if swap is enabled.
mem_total_memsw_bytes

// Hard memory limit for a container.
mem_limit_bytes

// Soft memory limit for a container.
mem_soft_limit_bytes

// Broken out memory usage information: pagecache, rss (anonymous),
// mmaped files and swap.

// TODO(chzhcn) mem_file_bytes and mem_anon_bytes are deprecated in
// 0.23.0 and will be removed in 0.24.0.
mem_file_bytes
mem_anon_bytes

// mem_cache_bytes is added in 0.23.0 to represent page cache usage.
mem_cache_bytes

// Since 0.23.0, mem_rss_bytes is changed to represent only
// anonymous memory usage. Note that neither its requiredness, type,
// name nor numeric tag has been changed.
mem_rss_bytes

mem_mapped_file_bytes
// This is only set if swap is enabled.
mem_swap_bytes
mem_unevictable_bytes

cgroups/net_cls

The cgroups/net_cls isolator allows operators to provide network performance isolation and network segmentation for containers within a Mesos cluster.
Read more
Metrics: none
cgroups/perf_event

TODO
appc/runtime

See docker/runtime below. Same concept, except for appc images.
Metrics: none
docker/runtime

The Docker Runtime isolator is used for supporting runtime configurations from the docker image (e.g., Entrypoint/Cmd, Env, etc.). This isolator is tied with --image_providers=docker. If --image_providers contains docker, this isolator must be used. Otherwise, the agent will refuse to start.
To enable the Docker Runtime isolator, append docker/runtime to the --isolation flag when starting the agent.
Currently, docker image default Entrypoint, Cmd, Env, and WorkingDir are supported with docker runtime isolator. Users can specify CommandInfo to override the default Entrypoint and Cmd in the image (see below for details). The CommandInfo should be inside of either TaskInfo or ExecutorInfo (depending on whether the task is a command task or uses a custom executor, respectively).
Read more
// The docker runtime isolator is responsible for preparing mesos
// container by merging runtime configuration specified by user
// and docker image default configuration.

Metrics: none
docker/volume

Allows using Docker Volumes within Mesos.
Read docs here
Metrics: none
volume/image

TODO
gpu/nvidia

TODO
namespaces/pid

PID namespaces isolate the process ID number space, meaning that
processes in different PID namespaces can have the same PID.  PID
namespaces allow containers to provide functionality such as
suspending/resuming the set of processes in the container and
migrating the container to a new host while the processes inside the
container maintain the same PIDs.

PIDs in a new PID namespace start at 1, somewhat like a standalone
system, and calls to fork(2), vfork(2), or clone(2) will produce
processes with PIDs that are unique within the namespace.

from pid_namespaces(7)
Metrics: none
network/cni

TODO
network/port_mapping

TODO