We want to:
- make sure that CPU usage in one container can't significantly affect CPU availability in a different container, regardless if the other container is below or above its CPU shares
- minimizing false-positive CPU overload notifications
- maximize utilization of unused CPU resources
To this end we would like to find a scheme that progressively gives more freedom in CPU resource consumption to processes that stay below their share of CPU resources (including using more than their allocated shares) while restricting processes that use more than their share of CPU resources, especially if CPU-bound, to use only their allocation.
To avoid performance cliffs, it would likely be better to find a progressive/dynamic solution rather than a step-wise/fixed one.
The key idea is to measure whether a process is CPU bound or not by characterizing typical resource consumption behavior for different types of processes:
- CPU-bound processes are likely to exhaust their CPU quanta and be preempted by the scheduler.
- I/O- or event- bound processes are likely to not exhaust their quanta and block on I/O or some lock; in general they are unlikely to be preempted. At the same time spinning processes may also be spinning around locks (e.g. a process with a memory leak and a concurrent collector), or around broken I/O calls.
Use avg_atom/nr_voluntary_switches/nr_involuntary_switches (or other entries) from /proc/$PID/sched; possibly augmented with knowledge on the number/type of syscalls made by the process.
- Binary solution: define two classes (low_latency, bulk) each with an associated period (e.g. 10ms, 200ms); if Δnr_voluntary_switches > Δnr_involuntary_switches then assign to low latency, otherwise assign to high latency
- Average-runtime solution: measure avg_runtime, i.e. how long the process runs on average before switching (both voluntary and involuntary); set the period to k*(avg_runtime) where k=2
- Percentile-runtime solution: measure runtime_percentile_x, i.e. how long the process runs before switching (both voluntary and involuntary) up to a certain percentile x% (e.g. 90%); set the period to runtime_percentile_x
The key idea is:
- as actual average CPU usage by the container tends to 0, the period quota tends to num_cpu * period_duration
- as actual average CPU usage by the container tends to num_cpu * period_duration, the period quota should tend to the minimum cpu time guaranteed by the CPU shares