Skip to content

Instantly share code, notes, and snippets.

@mcastelino
Last active December 17, 2023 09:14
  • Star 31 You must be signed in to star a gist
  • Fork 8 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save mcastelino/b8ce9a70b00ee56036dadd70ded53e9f to your computer and use it in GitHub Desktop.
Kubernetes and cgroups Resource Management/Static cpuManagerPolicy/Memory and Resource Isolation & Scheduling

Overview

The goal of this document to cover all aspects of Kubernetes management, including how resources are expressed, constrained and accounted for. This started a way to ensure that alternate container runtime implementation like Kata containers will behave from a resource accounting and consumption point of view in the same manner as runc.

Location of the latest version of this document: https://gist.github.com/mcastelino/b8ce9a70b00ee56036dadd70ded53e9f

If you do not understand cgroups please refer to a quick primer at the bottom of this document. This will help you understand how the resource enforcement actually works.

Kubernetes Resource Management

There are two things to consider.

  • What is enforced
  • What is scheduled/allocatable

Note: Scheduling is based on requests and not limits. Limits are enforced (using cgroups). Requests are used at scheduling (and sometimes enforced by cgroups). This will be important in all the discussion to follow.

Note: Also not all PODs have limits and requests. Not all requests will be guaranteed (see memory requests).

Kubernetes resources vs Node Capacity

Let us say we start we a node with 8 CPUs and 8GB of memory and 40GB of node local storage.

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7

Memory block size:       128M
Total online memory:       8G
Total offline memory:      0B

We want to set aside some resources for the node itself to function. This allows the node to be stable and system and user required services which are managed and scheduled outside of kubernetes to be protected/isolated from pods launched by kubernetes.

Kubernetes classifies these into two categories, kube and system.

When we launch kubernetes we set aside resources for kube and system.

Here

  • kube: kubernetes associated components. (kubelet, containerd/cri-o, ?shims?...) which are not explicitly placed in pods.
  • system: all other system daemons and the user processes

This can be done at configuration time using kubeadm configuration.

$ sudo -E kubeadm init --config=./kubeadm.yaml

where kubeadm.yaml:

apiVersion: kubeadm.k8s.io/v1beta1
kind: InitConfiguration
---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
# Allowing for CPU pinning and isolation in case of guaranteed QoS class
cpuManagerPolicy: static
systemReserved:
  cpu: 500m
  memory: 256M
kubeReserved:
  cpu: 500m
  memory: 256M
---
apiVersion: kubeadm.k8s.io/v1beta1
kind: ClusterConfiguration
networking:
  dnsDomain: cluster.local
  podSubnet: 10.244.0.0/16
  serviceSubnet: 10.96.0.0/12

Here we set aside 500MB or memory across kube and system. Also we have set aside 1 CPU.

Examining Node Resources

mrcastel@bored-pelinor:~$ kubectl describe node
Name:               bored-pelinor
Roles:              master
...
Capacity:
 cpu:                8
 ephemeral-storage:  40470732Ki
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             8167148Ki
 pods:               110
Allocatable:
 cpu:                7
 ephemeral-storage:  37297826550
 hugepages-1Gi:      0
 hugepages-2Mi:      0
 memory:             7564748Ki
 pods:               110
...

Here you will notice a few things

Capacity

 cpu:                8
 ephemeral-storage:  40470732Ki
 memory:             8167148Ki

Allocatable

 cpu:                7
 ephemeral-storage:  37297826550
 memory:             7564748Ki
  • The Allocatable has already excluded 1 CPU and 500MB of memory.
  • The Kubernetes scheduler will not schedule a pod on a node if total pod requests >= allocatable.
  • Limits are not considered when scheduling. They are only used to limit the resource consumption of a pod.

Resource scheduling

  • Kubernetes scheduler will allow a pod to land on a node if the requests fit within allocatable.
  • requests sometimes decide the minimum amount of resources the container is guaranteed
  • limits are always enforced so that a container will never exceed its limit
  • limits and requests are both optional. Presence and absence of them ends up defining the QoS class to which the POD will fall in.
  • Even though resources are expressed at a container level, the QoS is setup at a POD level
mrcastel@bored-pelinor:~$ kubectl describe node
...
Non-terminated Pods:         (9 in total)
  Namespace                  Name                                     CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                  ----                                     ------------  ----------  ---------------  -------------  ---
  default                    burst-2                                  300m (4%)     600m (8%)   400Mi (5%)       600Mi (8%)     155m
  kube-system                coredns-fb8b8dccf-669b2                  100m (1%)     0 (0%)      70Mi (0%)        170Mi (2%)     4h5m
  kube-system                coredns-fb8b8dccf-pjskz                  100m (1%)     0 (0%)      70Mi (0%)        170Mi (2%)     4h5m
  kube-system                etcd-bored-pelinor                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         4h4m
  kube-system                kube-apiserver-bored-pelinor             250m (3%)     0 (0%)      0 (0%)           0 (0%)         4h4m
  kube-system                kube-controller-manager-bored-pelinor    200m (2%)     0 (0%)      0 (0%)           0 (0%)         4h4m
  kube-system                kube-flannel-ds-amd64-q9cnc              100m (1%)     100m (1%)   50Mi (0%)        50Mi (0%)      3h48m
  kube-system                kube-proxy-csxkh                         0 (0%)        0 (0%)      0 (0%)           0 (0%)         4h5m
  kube-system                kube-scheduler-bored-pelinor             100m (1%)     0 (0%)      0 (0%)           0 (0%)         4h4m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests     Limits
  --------           --------     ------
  cpu                1150m (16%)  700m (10%)
  memory             590Mi (7%)   990Mi (13%)
  ephemeral-storage  0 (0%)       0 (0%)
  • Here we see the current status of the node w.r.t scheduling.
  • As the status indicates "Total limits may be over 100 percent, i.e., overcommitted.". This has significance later.

POD Resources

Containers in a POD express resource via requests and limits.

apiVersion: v1
kind: Pod
metadata:
  name: burst
spec:
  containers:
  - name: busybee
    image: busybox
    resources:
      limits:
        cpu: 500m
        memory: "400Mi"
      requests:
        cpu: 250m
        memory: "300Mi"
    command: ["md5sum"]
    args: ["/dev/urandom"]

Implementing resource management

kube and system Reserved

From openshift documentation

Optionally, the node can be made to enforce kube-reserved and system-reserved by specifying those tokens in the enforce-node-allocatable flag. 
If specified, the corresponding --kube-reserved-cgroup or --system-reserved-cgroup needs to be provided. In future releases, the node and container 
runtime will be packaged in a common cgroup separate from system.slice. Until that time, we do not recommend users change the default value of enforce-node-allocatable flag.

Administrators should treat system daemons similar to Guaranteed pods. System daemons can burst within their bounding control groups and this behavior 
needs to be managed as part of cluster deployments. Enforcing system-reserved limits can lead to critical system services being CPU starved or OOM killed 
on the node. The recommendation is to enforce system-reserved only if operators have profiled their nodes exhaustively to determine precise estimates and 
are confident in their ability to recover if any process in that group is OOM killed.

As a result, we strongly recommended that users only enforce node allocatable for pods by default, and set aside appropriate reservations for system 
daemons to maintain overall node reliability.

  • If not done there is a potential kube and system to impact guaranteed cpusets

cgroups and CPU constraints

From the Linux Kernel: Documentation/cgroups/cpu.txt

   - cpu.shares: The weight of each group living in the same hierarchy, that
 translates into the amount of CPU it is expected to get. Upon cgroup creation,
 each group gets assigned a default of 1024. The percentage of CPU assigned to
 the cgroup is the value of shares divided by the sum of all shares in all
 cgroups in the same level.
 
  - cpu.cfs_period_us: The duration in microseconds of each scheduler period, for
 bandwidth decisions. This defaults to 100000us or 100ms. Larger periods will
 improve throughput at the expense of latency, since the scheduler will be able
 to sustain a cpu-bound workload for longer. The opposite of true for smaller
 periods. Note that this only affects non-RT tasks that are scheduled by the
 CFS scheduler.
 
- cpu.cfs_quota_us: The maximum time in microseconds during each cfs_period_us
  in for the current group will be allowed to run. For instance, if it is set to
  half of cpu_period_us, the cgroup will only be able to peak run for 50 % of
  the time. One should note that this represents aggregate time over all CPUs
  in the system. Therefore, in order to allow full usage of two CPUs, for
  instance, one should set this value to twice the value of cfs_period_us.

cpu.shares

This controls the minimum amount of CPU quota

Hence kubepods gets 7168 / (7168 + 1024 + 1024) i.e. ~ 77% of the CPU on the system. Open: This is not quite the 7/8 we expected.

All pods are under kubepods, hence fit within the 77%.

cpu.cfs_quota_us

This controls the upper bound of CPU usage

At the lower levels this is set to -1 which is basically unbounded. Hence all bounds are enforced at pod level or in some cases at container level.

CPU Resource Management

  • Our assumption is that for kubepods to not be impacted by kubeReserved and systemReserved the kubepod hierarchy should be admin created and ensure that there is no overlap of cpusets. If not guaranteed pods can be impacted by system daemons and kube components.
# cat /sys/fs/cgroup/cpu/cpu.shares
1024
# cat /sys/fs/cgroup/cpu/kubepods/cpu.shares
7168
# cat /sys/fs/cgroup/cpu/user.slice/cpu.shares
1024
# cat /sys/fs/cgroup/cpu/system.slice/cpu.shares
1024
  • This means that under heavy load the kubepods are assured of getting CPU time. Hence pods will not starve.
  • This also means that kubepods get ~7/8 of the CPU scheduling time when the node is over committed.
root@bored-pelinor:/sys/fs/cgroup/cpu# cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us
-1
# cat /sys/fs/cgroup/cpu/kubepods/cpu.cfs_quota_us
-1
# cat /sys/fs/cgroup/cpu/system.slice/cpu.cfs_quota_us
-1
# cat /sys/fs/cgroup/cpu/user.slice/cpu.cfs_quota_us
-1
  • This means that kubepods, system and user slices have been setup with no upper bounds.
  • Hence can consume as much of the CPU as needed when system is not over-committed (unless individual processes are confined using child cgroups)

Memory Resource Management

#cat /sys/fs/cgroup/memory/memory.limit_in_bytes
9223372036854771712
#cat /sys/fs/cgroup/memory/user.slice/memory.limit_in_bytes
9223372036854771712
#cat /sys/fs/cgroup/memory/system.slice/memory.limit_in_bytes
9223372036854771712
#cat /sys/fs/cgroup/memory/kubepods/memory.limit_in_bytes
7851159552
  • The kubepods are limited to 7487 MB. Which excludes the 500MB we set aside for kube and system.
  • This means that even though memory requests are not enforced at cgroup level as you will soon see. This top level limit enforcement will ensure that pods do not exceed what is allocated at a system level for pods.
  • This also means that pods are free to grow over their requests, until they hit thier limits as long as the total memory consumption of all pods exceeds the total memory allocated to kubepods.
  • This ensures that pods will never exceed the maximum allocation
    • This is why kubernetes memory management works fine even though soft limits are not setup
  • However it does mean that the control plane pods may get killed indiscriminately if workload pods live beyong thier requests
    • Open: How can this be handled properly?

Note: Kubernetes does not use memory soft limits. Hence the requests are used strictly for kubernetes to decide scheduling. If the user creates a pod whose actual usage exceeds the requests (i.e. closer to the limit, rather than the request) it will result in the actual memory usage on the node exceed what was scheduled. This will result in the kernel OOM killing some process.

Open: All things being equal which process will the kernel OOM kill. The kernel document says it will kill the bulkiest task???

From the Kernel documentation:

2.5 Reclaim

Each cgroup maintains a per cgroup LRU which has the same structure as
global VM. When a cgroup goes over its limit, we first try
to reclaim memory from the cgroup so as to make space for the new
pages that the cgroup has touched. If the reclaim is unsuccessful,
an OOM routine is invoked to select and kill the bulkiest task in the
cgroup. (See 10. OOM Control below.)

2.3 Shared Page Accounting

Shared pages are accounted on the basis of the first touch approach. The
cgroup that first touches a page is accounted for the page. The principle
behind this approach is that a cgroup that aggressively uses a shared
page will eventually get charged for it (once it is uncharged from
the cgroup that brought it in -- this will happen on memory pressure).

cpuset management

#cat /sys/fs/cgroup/cpuset/cpuset.cpus
0-7
#cat /sys/fs/cgroup/cpuset/kubepods/cpuset.cpus
0-7
  • Even though we have requested a static cpu policy the kubepods today will include the full set of CPUs.
  • Note: This is true even when specifying integer CPU requests for kube and system.
  • Open: If we created separate cpu hierarchies for kube and system, we need to figure out how to setup the cpuset properly for kubepods.

POD Resource Management

PODs can be of three QoS classes with variants

  • Best Effort
  • Burstable
    • No limit
    • With limit
  • Guaranteed
    • static policy
    • non static

What happens

Guaranteed pods

  • may have a cpuset
  • will have cpu shares set
  • will have quota set (which is == cpu shares upconverted)

Hence they will have an upper bound on performance (and upper bound == lower bound). Upper bound is also guaranteed.

Even though the limits may seem to be lesser than the resources assigned to other types of pods, in reality as they do not contend with any other processes on the CPU's which are dedicated to them, they perform better than the pods in the common pool.

  • The only source of interference to these pods are kube and system components.
  • If this interference is large it can potentially impact the performance of these pods.

Burstable pods with limits

  • may have cpuset (which will be the shared kubepod pool which excludes any cpus given to any guaranteed pods)
  • will have cpu shares set (which determines guaranteed lower bound)
  • will have quota set (upper bound > lower bound; and upper bound is not guaranteed due to CFS implementation)

Burstable pods without limits

  • may have cpuset (which will be the shared kubepod pool which excludes any cpus given to any guaranteed pods)
  • will have cpu shares set (which determines guaranteed lower bound)
  • quota set to -1 (no upper bound)

These are a good choice for any workload as they give you a degree of assured CPU performance, while still being unlimited when capacity is available.

Best effort

  • may have cpuset (which will be the shared kubepod pool which excludes any cpus given to any guaranteed pods)
  • will have cpu shares set to 2. So a minimal lower bound of performance.
  • quota set to -1 (no upper bound)

Guaranteed Example

apiVersion: v1
kind: Pod
metadata:
  name: guar-2s
spec:
  containers:
  - name: busybee
    image: busybox
    resources:
      limits:
        cpu: 2
        memory: "400Mi"
    command: ["md5sum"]
    args: ["/dev/urandom"]
  - name: busybum
    image: busybox
    command: [ "top" ]
    resources:
      limits:
        cpu: 1
        memory: "200Mi"
    command: ["md5sum"]
    args: ["/dev/urandom"]
$kubectl get pod -o=custom-columns=NAME:.metadata.name,UID:.metadata.uid
NAME      UID
guar-2s   99b66879-565f-11e9-9de2-525400123456
#cat /sys/fs/cgroup/cpuset/kubepods/cpuset.cpus
0-7
#cat /sys/fs/cgroup/cpuset/kubepods/pod99b66879-565f-11e9-9de2-525400123456/cpuset.cpus
0-7
#cat /sys/fs/cgroup/cpuset/kubepods/pod99b66879-565f-11e9-9de2-525400123456/2882ebb5acb99f5b09ee41954720101a954847151951ff56e0e3a919a2044a5a/cpuset.cpus
0-7
#cat /sys/fs/cgroup/cpuset/kubepods/pod99b66879-565f-11e9-9de2-525400123456/2af8973b96abe2e023eb343cb87989a33c08b61f1bf2eead306da8f15eb74026/cpuset.cpus
3
#cat /sys/fs/cgroup/cpuset/kubepods/pod99b66879-565f-11e9-9de2-525400123456/a547d6c57f8b08a493156febb9b7071320ce1d87631e2e7bea142b7ac9351f23/cpuset.cpus
1-2
#cat /sys/fs/cgroup/cpu/kubepods/pod99b66879-565f-11e9-9de2-525400123456/cpu.shares
3072
#cat /sys/fs/cgroup/cpu/kubepods/pod99b66879-565f-11e9-9de2-525400123456/2882ebb5acb99f5b09ee41954720101a954847151951ff56e0e3a919a2044a5a/cpu.shares
2
#cat /sys/fs/cgroup/cpu/kubepods/pod99b66879-565f-11e9-9de2-525400123456/2af8973b96abe2e023eb343cb87989a33c08b61f1bf2eead306da8f15eb74026/cpu.shares
1024
#cat /sys/fs/cgroup/cpu/kubepods/pod99b66879-565f-11e9-9de2-525400123456/a547d6c57f8b08a493156febb9b7071320ce1d87631e2e7bea142b7ac9351f23/cpu.shares
2048
#cat /sys/fs/cgroup/cpu/kubepods/cpu.cfs_quota_us
-1
#cat /sys/fs/cgroup/cpu/kubepods/pod99b66879-565f-11e9-9de2-525400123456/cpu.cfs_quota_us
300000
#cat /sys/fs/cgroup/cpu/kubepods/pod99b66879-565f-11e9-9de2-525400123456/2882ebb5acb99f5b09ee41954720101a954847151951ff56e0e3a919a2044a5a/cpu.cfs_quota_us
-1
#cat /sys/fs/cgroup/cpu/kubepods/pod99b66879-565f-11e9-9de2-525400123456/2af8973b96abe2e023eb343cb87989a33c08b61f1bf2eead306da8f15eb74026/cpu.cfs_quota_us
100000
#cat /sys/fs/cgroup/cpu/kubepods/pod99b66879-565f-11e9-9de2-525400123456/a547d6c57f8b08a493156febb9b7071320ce1d87631e2e7bea142b7ac9351f23/cpu.cfs_quota_us
200000

Here you will see the following

  • pod itself is not pinned (this is very important to note)
  • pause container is not pinned
  • pause container is assured only 2 cpu shares but unbounded
  • container1 is pinned to 3 and assured 1024 cpu shares (1024/(2+1024+2048)). Yes it does not add up to 3072 but close. It is also bounded to 1 CPU upper bound in scheduling time.
  • container2 is pinned to 1 and 2, assured 2048 cpu shared and upper bound to 2 CPUs worth of scheduling time.
  • Hence the pause (i.e. The sandbox container is upper bound by the total cfs_quota). However the sandbox/pause itself is not pinned.
  • Given that kubepod is effectively unbounded the quota is enforced at pod level 300000 to be nett of 3 CPUs worth of time.

System Level View

Let us create pods of each type

Best Effort

apiVersion: v1
kind: Pod
metadata:
  name: beff-2
spec:
  containers:
  - name: busybee
    image: busybox
    command: ["md5sum"]
    args: ["/dev/urandom"]
  - name: busybum
    image: busybox
    command: ["md5sum"]
    args: ["/dev/urandom"]
    

Burstable

apiVersion: v1
kind: Pod
metadata:
  name: burst-2
spec:
  containers:
  - name: busybee
    image: busybox
    resources:
      limits:
        cpu: 500m
        memory: "400Mi"
      requests:
        cpu: 250m
        memory: "300Mi"
    command: ["md5sum"]
    args: ["/dev/urandom"]
  - name: busybum
    image: busybox
    resources:
      limits:
        cpu: 100m
        memory: "200Mi"
      requests:
        cpu: 50m
        memory: "100Mi"
    command: ["md5sum"]
    args: ["/dev/urandom"]

Guaranteed with non integer CPUs

apiVersion: v1
kind: Pod
metadata:
  name: guar-2
spec:
  containers:
  - name: busybee
    image: busybox
    resources:
      limits:
        cpu: 400m
        memory: "400Mi"
    command: ["md5sum"]
    args: ["/dev/urandom"]
  - name: busybum
    image: busybox
    resources:
      limits:
        cpu: 200m
        memory: "200Mi"
    command: ["md5sum"]
    args: ["/dev/urandom"]

Guaranteed with integer CPUs

apiVersion: v1
kind: Pod
metadata:
  name: guar-2s
spec:
  containers:
  - name: busybee
    image: busybox
    resources:
      limits:
        cpu: 2
        memory: "400Mi"
    command: ["md5sum"]
    args: ["/dev/urandom"]
  - name: busybum
    image: busybox
    resources:
      limits:
        cpu: 1
        memory: "200Mi"
    command: ["md5sum"]
    args: ["/dev/urandom"]

Resource and PODs

mrcastel@bored-pelinor:~$ kubectl get po
NAME      READY   STATUS    RESTARTS   AGE
beff-2    2/2     Running   0          4m50s
burst-2   2/2     Running   0          12m
guar-2    2/2     Running   0          4m38s
guar-2s   2/2     Running   0          53m
mrcastel@bored-pelinor:~$ kubectl get pod --all-namespaces -o=custom-columns=NAME:.metadata.name,UID:.metadata.uid
NAME                                    UID
beff-2                                  55148103-5666-11e9-9de2-525400123456
burst-2                                 41a37c57-5665-11e9-9de2-525400123456
guar-2                                  5c061c35-5666-11e9-9de2-525400123456
guar-2s                                 99b66879-565f-11e9-9de2-525400123456
coredns-fb8b8dccf-ng6st                 f3f5916c-565e-11e9-9de2-525400123456
coredns-fb8b8dccf-tctwp                 f3f524b6-565e-11e9-9de2-525400123456
etcd-bored-pelinor                      16588327-565f-11e9-9de2-525400123456
kube-apiserver-bored-pelinor            16f110e8-565f-11e9-9de2-525400123456
kube-controller-manager-bored-pelinor   11943a2d-565f-11e9-9de2-525400123456
kube-flannel-ds-amd64-gtwrh             45e519f1-565f-11e9-9de2-525400123456
kube-proxy-vk6j9                        f3f04c5b-565e-11e9-9de2-525400123456
kube-scheduler-bored-pelinor            1b1d33da-565f-11e9-9de2-525400123456

CPU Sets

We see that

  • Only the Guaranteed containers promise not to trample on each other, as evidenced by the containers in those pods being pinned
    • The other Guaranteed pod is set to 0,4-7 which excludes 3 and 1-2
  • Best effort and burstable containers are excluded from the pinned sets
  • However kube components can still land on the pinned cpus (but the scheduling quota should mitigate that effect)
#for i in `ls /sys/fs/cgroup/cpuset/kubepods/**/cpuset.cpus`; do echo $i && cat $i; done
/sys/fs/cgroup/cpuset/kubepods/besteffort/cpuset.cpus
0-7
/sys/fs/cgroup/cpuset/kubepods/besteffort/pod55148103-5666-11e9-9de2-525400123456/02f5cf17015d31beb2462857e1773754221712389d17df5a3a1e636bc04daaac/cpuset.cpus
0,4-7
/sys/fs/cgroup/cpuset/kubepods/besteffort/pod55148103-5666-11e9-9de2-525400123456/1dad0913e3a373e1742ef9dbc707fbcaab65fa1d394d5dc5c7df6c6d36569db7/cpuset.cpus
0,4-7
/sys/fs/cgroup/cpuset/kubepods/besteffort/pod55148103-5666-11e9-9de2-525400123456/6d50e7ca6b84fbd4195155e812e04e7b5976666818fe69e6d28832e63fae639f/cpuset.cpus
0-7
/sys/fs/cgroup/cpuset/kubepods/besteffort/pod55148103-5666-11e9-9de2-525400123456/cpuset.cpus
0-7
/sys/fs/cgroup/cpuset/kubepods/besteffort/pod69b754681cf0cf1bf12010694a10f2cb/6770f6eccae402706a68b71d9a593cc9f64aa2961419f2bfad4f57b265ded453/cpuset.cpus
0-7
/sys/fs/cgroup/cpuset/kubepods/besteffort/pod69b754681cf0cf1bf12010694a10f2cb/bab92c486dd6626714666e28e41672dc2250302699d687d7850b60550b8f03ad/cpuset.cpus
0,4-7
/sys/fs/cgroup/cpuset/kubepods/besteffort/pod69b754681cf0cf1bf12010694a10f2cb/cpuset.cpus
0-7
/sys/fs/cgroup/cpuset/kubepods/besteffort/podf3f04c5b-565e-11e9-9de2-525400123456/b4c11a95aa1b5ab85649e23a2905cb1b893c4aa3b0201e285e4fcef647bfe584/cpuset.cpus
0-7
/sys/fs/cgroup/cpuset/kubepods/besteffort/podf3f04c5b-565e-11e9-9de2-525400123456/cpuset.cpus
0-7
/sys/fs/cgroup/cpuset/kubepods/besteffort/podf3f04c5b-565e-11e9-9de2-525400123456/e9fef3c19e2966fb2870e99fcc682ac5a350fdf847ee415f6f3aaf88bbfbc17a/cpuset.cpus
0-7
/sys/fs/cgroup/cpuset/kubepods/besteffort/podf3f04c5b-565e-11e9-9de2-525400123456/e9fef3c19e2966fb2870e99fcc682ac5a350fdf847ee415f6f3aaf88bbfbc17a/kube-proxy/cpuset.cpus
0-7
/sys/fs/cgroup/cpuset/kubepods/burstable/cpuset.cpus
0-7
/sys/fs/cgroup/cpuset/kubepods/burstable/pod41a37c57-5665-11e9-9de2-525400123456/31e7f4b8bfe5a23919d507c8d683ea97a83698f3e5f4abc13e4f5b401a26f3f1/cpuset.cpus
0,4-7
/sys/fs/cgroup/cpuset/kubepods/burstable/pod41a37c57-5665-11e9-9de2-525400123456/62c5bf39f9bbd6dacb93b0a34e78f6c8db2ff0962bb90ad9ee564fb50b9c5554/cpuset.cpus
0,4-7
/sys/fs/cgroup/cpuset/kubepods/burstable/pod41a37c57-5665-11e9-9de2-525400123456/cpuset.cpus
0-7
/sys/fs/cgroup/cpuset/kubepods/burstable/pod41a37c57-5665-11e9-9de2-525400123456/f876aa7433c71aec840afc385a39e1f3c1541c4521e148c3771e8679f040b788/cpuset.cpus
0-7
/sys/fs/cgroup/cpuset/kubepods/burstable/pod439651677ca7971bec7b2a9a0df5a512/1c3e8063c7bba7cf0e4b6e777704e56e5bdcbb15a794d36423a4734e6d5cb751/cpuset.cpus
0,4-7
/sys/fs/cgroup/cpuset/kubepods/burstable/pod439651677ca7971bec7b2a9a0df5a512/a0ff4137e61acf00ed6dc94212dbfba5161a7cb5fe20217e2cfaa1ba91474ae8/cpuset.cpus
0-7
/sys/fs/cgroup/cpuset/kubepods/burstable/pod439651677ca7971bec7b2a9a0df5a512/cpuset.cpus
0-7
/sys/fs/cgroup/cpuset/kubepods/burstable/pod54146492ed90bfa147f56609eee8005a/55770fabdef138519c8e013a2150630a60bf7dc0e73d5899ca99e72124f23434/cpuset.cpus
0,4-7
/sys/fs/cgroup/cpuset/kubepods/burstable/pod54146492ed90bfa147f56609eee8005a/cd1698af7a1024920e12633350d98d3f45f83d775eb9ada5c76319cfe0fb9573/cpuset.cpus
0-7
/sys/fs/cgroup/cpuset/kubepods/burstable/pod54146492ed90bfa147f56609eee8005a/cpuset.cpus
0-7
/sys/fs/cgroup/cpuset/kubepods/burstable/pod58272442e226c838b193bbba4c44091e/22b075e3e9b9e8ad3bc7b5f2fb359de6ccbc751988e9b9cdb82c77a284de9847/cpuset.cpus
0,4-7
/sys/fs/cgroup/cpuset/kubepods/burstable/pod58272442e226c838b193bbba4c44091e/cpuset.cpus
0-7
/sys/fs/cgroup/cpuset/kubepods/burstable/pod58272442e226c838b193bbba4c44091e/ea20f04b19b1ce96cdd48f88a6a8a588a4352cd8a2fd7c6490d56b139f6d7f39/cpuset.cpus
0-7
/sys/fs/cgroup/cpuset/kubepods/burstable/podf3f524b6-565e-11e9-9de2-525400123456/4db061942925e99a1c687c5195fca2114e39347b22bc7cfcba51438de4efa31f/cpuset.cpus
0,4-7
/sys/fs/cgroup/cpuset/kubepods/burstable/podf3f524b6-565e-11e9-9de2-525400123456/cpuset.cpus
0-7
/sys/fs/cgroup/cpuset/kubepods/burstable/podf3f524b6-565e-11e9-9de2-525400123456/fde98ded39d10e37a883c8e08fadc8e541b1793f44548f0b3e028b6c1ddd9034/cpuset.cpus
0-7
/sys/fs/cgroup/cpuset/kubepods/burstable/podf3f5916c-565e-11e9-9de2-525400123456/c419fde416d40bf1d89c8dbe374b229ef3a48dbce5cf808c7caf5282254a4ece/cpuset.cpus
0,4-7
/sys/fs/cgroup/cpuset/kubepods/burstable/podf3f5916c-565e-11e9-9de2-525400123456/cddf60599b04a5ff107254047bceea0608cf59a82de9626f6f4b9c2778705be6/cpuset.cpus
0-7
/sys/fs/cgroup/cpuset/kubepods/burstable/podf3f5916c-565e-11e9-9de2-525400123456/cpuset.cpus
0-7
/sys/fs/cgroup/cpuset/kubepods/cpuset.cpus
0-7
/sys/fs/cgroup/cpuset/kubepods/pod45e519f1-565f-11e9-9de2-525400123456/75f8d2d2358fe8b7516f29472bb11378b108d0567dc820d0fc757469c2c9ca0f/cpuset.cpus
0,4-7
/sys/fs/cgroup/cpuset/kubepods/pod45e519f1-565f-11e9-9de2-525400123456/cpuset.cpus
0-7
/sys/fs/cgroup/cpuset/kubepods/pod45e519f1-565f-11e9-9de2-525400123456/e1a54b680bdef8d6ee1acf1fde4a2a35db950cb5ad106d11f694fc10841bdcf0/cpuset.cpus
0-7
/sys/fs/cgroup/cpuset/kubepods/pod5c061c35-5666-11e9-9de2-525400123456/12d48c76f025ea5b0a2f3cbab6765b75c5666ef74528da4ef7d8c4260a075faa/cpuset.cpus
0,4-7
/sys/fs/cgroup/cpuset/kubepods/pod5c061c35-5666-11e9-9de2-525400123456/9a9a45ea9730ef8a313070103f2701a4d1942c90eb985938d77b968f5cac460e/cpuset.cpus
0,4-7
/sys/fs/cgroup/cpuset/kubepods/pod5c061c35-5666-11e9-9de2-525400123456/cpuset.cpus
0-7
/sys/fs/cgroup/cpuset/kubepods/pod5c061c35-5666-11e9-9de2-525400123456/de7559571058b943737889052c3ebd48cba281b7ab3b0b2327825d8256e570da/cpuset.cpus
0-7
/sys/fs/cgroup/cpuset/kubepods/pod99b66879-565f-11e9-9de2-525400123456/2882ebb5acb99f5b09ee41954720101a954847151951ff56e0e3a919a2044a5a/cpuset.cpus
0-7
/sys/fs/cgroup/cpuset/kubepods/pod99b66879-565f-11e9-9de2-525400123456/2af8973b96abe2e023eb343cb87989a33c08b61f1bf2eead306da8f15eb74026/cpuset.cpus
3
/sys/fs/cgroup/cpuset/kubepods/pod99b66879-565f-11e9-9de2-525400123456/a547d6c57f8b08a493156febb9b7071320ce1d87631e2e7bea142b7ac9351f23/cpuset.cpus
1-2
/sys/fs/cgroup/cpuset/kubepods/pod99b66879-565f-11e9-9de2-525400123456/cpuset.cpus
0-7

CPU Shares

Here we see that

  • Best effort pods all get assured only 2 CPU shares, and this is set at POD level
  • Burstable and Guaranteed PODs get assured their minimum cpu share, and this is set at POD level.
    • And containers within them further split this share.
#shopt -s globstar
#for i in `ls /sys/fs/cgroup/cpu/kubepods/**/cpu.shares`; do echo $i && cat $i; done
/sys/fs/cgroup/cpu/kubepods/besteffort/cpu.shares
2
/sys/fs/cgroup/cpu/kubepods/besteffort/pod55148103-5666-11e9-9de2-525400123456/02f5cf17015d31beb2462857e1773754221712389d17df5a3a1e636bc04daaac/cpu.shares
2
/sys/fs/cgroup/cpu/kubepods/besteffort/pod55148103-5666-11e9-9de2-525400123456/1dad0913e3a373e1742ef9dbc707fbcaab65fa1d394d5dc5c7df6c6d36569db7/cpu.shares
2
/sys/fs/cgroup/cpu/kubepods/besteffort/pod55148103-5666-11e9-9de2-525400123456/6d50e7ca6b84fbd4195155e812e04e7b5976666818fe69e6d28832e63fae639f/cpu.shares
2
/sys/fs/cgroup/cpu/kubepods/besteffort/pod55148103-5666-11e9-9de2-525400123456/cpu.shares
2
/sys/fs/cgroup/cpu/kubepods/besteffort/pod69b754681cf0cf1bf12010694a10f2cb/6770f6eccae402706a68b71d9a593cc9f64aa2961419f2bfad4f57b265ded453/cpu.shares
2
/sys/fs/cgroup/cpu/kubepods/besteffort/pod69b754681cf0cf1bf12010694a10f2cb/bab92c486dd6626714666e28e41672dc2250302699d687d7850b60550b8f03ad/cpu.shares
2
/sys/fs/cgroup/cpu/kubepods/besteffort/pod69b754681cf0cf1bf12010694a10f2cb/cpu.shares
2
/sys/fs/cgroup/cpu/kubepods/besteffort/podf3f04c5b-565e-11e9-9de2-525400123456/b4c11a95aa1b5ab85649e23a2905cb1b893c4aa3b0201e285e4fcef647bfe584/cpu.shares
2
/sys/fs/cgroup/cpu/kubepods/besteffort/podf3f04c5b-565e-11e9-9de2-525400123456/cpu.shares
2
/sys/fs/cgroup/cpu/kubepods/besteffort/podf3f04c5b-565e-11e9-9de2-525400123456/e9fef3c19e2966fb2870e99fcc682ac5a350fdf847ee415f6f3aaf88bbfbc17a/cpu.shares
2
/sys/fs/cgroup/cpu/kubepods/besteffort/podf3f04c5b-565e-11e9-9de2-525400123456/e9fef3c19e2966fb2870e99fcc682ac5a350fdf847ee415f6f3aaf88bbfbc17a/kube-proxy/cpu.shares
1024
/sys/fs/cgroup/cpu/kubepods/burstable/cpu.shares
1075
/sys/fs/cgroup/cpu/kubepods/burstable/pod41a37c57-5665-11e9-9de2-525400123456/31e7f4b8bfe5a23919d507c8d683ea97a83698f3e5f4abc13e4f5b401a26f3f1/cpu.shares
256
/sys/fs/cgroup/cpu/kubepods/burstable/pod41a37c57-5665-11e9-9de2-525400123456/62c5bf39f9bbd6dacb93b0a34e78f6c8db2ff0962bb90ad9ee564fb50b9c5554/cpu.shares
51
/sys/fs/cgroup/cpu/kubepods/burstable/pod41a37c57-5665-11e9-9de2-525400123456/cpu.shares
307
/sys/fs/cgroup/cpu/kubepods/burstable/pod41a37c57-5665-11e9-9de2-525400123456/f876aa7433c71aec840afc385a39e1f3c1541c4521e148c3771e8679f040b788/cpu.shares
2
/sys/fs/cgroup/cpu/kubepods/burstable/pod439651677ca7971bec7b2a9a0df5a512/1c3e8063c7bba7cf0e4b6e777704e56e5bdcbb15a794d36423a4734e6d5cb751/cpu.shares
256
/sys/fs/cgroup/cpu/kubepods/burstable/pod439651677ca7971bec7b2a9a0df5a512/a0ff4137e61acf00ed6dc94212dbfba5161a7cb5fe20217e2cfaa1ba91474ae8/cpu.shares
2
/sys/fs/cgroup/cpu/kubepods/burstable/pod439651677ca7971bec7b2a9a0df5a512/cpu.shares
256
/sys/fs/cgroup/cpu/kubepods/burstable/pod54146492ed90bfa147f56609eee8005a/55770fabdef138519c8e013a2150630a60bf7dc0e73d5899ca99e72124f23434/cpu.shares
204
/sys/fs/cgroup/cpu/kubepods/burstable/pod54146492ed90bfa147f56609eee8005a/cd1698af7a1024920e12633350d98d3f45f83d775eb9ada5c76319cfe0fb9573/cpu.shares
2
/sys/fs/cgroup/cpu/kubepods/burstable/pod54146492ed90bfa147f56609eee8005a/cpu.shares
204
/sys/fs/cgroup/cpu/kubepods/burstable/pod58272442e226c838b193bbba4c44091e/22b075e3e9b9e8ad3bc7b5f2fb359de6ccbc751988e9b9cdb82c77a284de9847/cpu.shares
102
/sys/fs/cgroup/cpu/kubepods/burstable/pod58272442e226c838b193bbba4c44091e/cpu.shares
102
/sys/fs/cgroup/cpu/kubepods/burstable/pod58272442e226c838b193bbba4c44091e/ea20f04b19b1ce96cdd48f88a6a8a588a4352cd8a2fd7c6490d56b139f6d7f39/cpu.shares
2
/sys/fs/cgroup/cpu/kubepods/burstable/podf3f524b6-565e-11e9-9de2-525400123456/4db061942925e99a1c687c5195fca2114e39347b22bc7cfcba51438de4efa31f/cpu.shares
102
/sys/fs/cgroup/cpu/kubepods/burstable/podf3f524b6-565e-11e9-9de2-525400123456/cpu.shares
102
/sys/fs/cgroup/cpu/kubepods/burstable/podf3f524b6-565e-11e9-9de2-525400123456/fde98ded39d10e37a883c8e08fadc8e541b1793f44548f0b3e028b6c1ddd9034/cpu.shares
2
/sys/fs/cgroup/cpu/kubepods/burstable/podf3f5916c-565e-11e9-9de2-525400123456/c419fde416d40bf1d89c8dbe374b229ef3a48dbce5cf808c7caf5282254a4ece/cpu.shares
102
/sys/fs/cgroup/cpu/kubepods/burstable/podf3f5916c-565e-11e9-9de2-525400123456/cddf60599b04a5ff107254047bceea0608cf59a82de9626f6f4b9c2778705be6/cpu.shares
2
/sys/fs/cgroup/cpu/kubepods/burstable/podf3f5916c-565e-11e9-9de2-525400123456/cpu.shares
102
/sys/fs/cgroup/cpu/kubepods/cpu.shares
7168
/sys/fs/cgroup/cpu/kubepods/pod45e519f1-565f-11e9-9de2-525400123456/75f8d2d2358fe8b7516f29472bb11378b108d0567dc820d0fc757469c2c9ca0f/cpu.shares
102
/sys/fs/cgroup/cpu/kubepods/pod45e519f1-565f-11e9-9de2-525400123456/cpu.shares
102
/sys/fs/cgroup/cpu/kubepods/pod45e519f1-565f-11e9-9de2-525400123456/e1a54b680bdef8d6ee1acf1fde4a2a35db950cb5ad106d11f694fc10841bdcf0/cpu.shares
2
/sys/fs/cgroup/cpu/kubepods/pod5c061c35-5666-11e9-9de2-525400123456/12d48c76f025ea5b0a2f3cbab6765b75c5666ef74528da4ef7d8c4260a075faa/cpu.shares
204
/sys/fs/cgroup/cpu/kubepods/pod5c061c35-5666-11e9-9de2-525400123456/9a9a45ea9730ef8a313070103f2701a4d1942c90eb985938d77b968f5cac460e/cpu.shares
409
/sys/fs/cgroup/cpu/kubepods/pod5c061c35-5666-11e9-9de2-525400123456/cpu.shares
614
/sys/fs/cgroup/cpu/kubepods/pod5c061c35-5666-11e9-9de2-525400123456/de7559571058b943737889052c3ebd48cba281b7ab3b0b2327825d8256e570da/cpu.shares
2
/sys/fs/cgroup/cpu/kubepods/pod99b66879-565f-11e9-9de2-525400123456/2882ebb5acb99f5b09ee41954720101a954847151951ff56e0e3a919a2044a5a/cpu.shares
2
/sys/fs/cgroup/cpu/kubepods/pod99b66879-565f-11e9-9de2-525400123456/2af8973b96abe2e023eb343cb87989a33c08b61f1bf2eead306da8f15eb74026/cpu.shares
1024
/sys/fs/cgroup/cpu/kubepods/pod99b66879-565f-11e9-9de2-525400123456/a547d6c57f8b08a493156febb9b7071320ce1d87631e2e7bea142b7ac9351f23/cpu.shares
2048
/sys/fs/cgroup/cpu/kubepods/pod99b66879-565f-11e9-9de2-525400123456/cpu.shares
3072

CPU Quota

Here we see that

  • Some burstable have quota (i.e. upper bound set)
    • the quota set at pod level and then split across the containers and the period is the same across all (as seen below)
  • pause containers of guaranteed pods are limited by the parent (assumption to validate)
#shopt -s globstar
#for i in `ls /sys/fs/cgroup/cpu/kubepods/**/cpu.cfs_quota_us`; do echo $i && cat $i; done
/sys/fs/cgroup/cpu/kubepods/besteffort/cpu.cfs_quota_us
-1
/sys/fs/cgroup/cpu/kubepods/besteffort/pod55148103-5666-11e9-9de2-525400123456/02f5cf17015d31beb2462857e1773754221712389d17df5a3a1e636bc04daaac/cpu.cfs_quota_us
-1
/sys/fs/cgroup/cpu/kubepods/besteffort/pod55148103-5666-11e9-9de2-525400123456/1dad0913e3a373e1742ef9dbc707fbcaab65fa1d394d5dc5c7df6c6d36569db7/cpu.cfs_quota_us
-1
/sys/fs/cgroup/cpu/kubepods/besteffort/pod55148103-5666-11e9-9de2-525400123456/6d50e7ca6b84fbd4195155e812e04e7b5976666818fe69e6d28832e63fae639f/cpu.cfs_quota_us
-1
/sys/fs/cgroup/cpu/kubepods/besteffort/pod55148103-5666-11e9-9de2-525400123456/cpu.cfs_quota_us
-1
/sys/fs/cgroup/cpu/kubepods/besteffort/pod69b754681cf0cf1bf12010694a10f2cb/6770f6eccae402706a68b71d9a593cc9f64aa2961419f2bfad4f57b265ded453/cpu.cfs_quota_us
-1
/sys/fs/cgroup/cpu/kubepods/besteffort/pod69b754681cf0cf1bf12010694a10f2cb/bab92c486dd6626714666e28e41672dc2250302699d687d7850b60550b8f03ad/cpu.cfs_quota_us
-1
/sys/fs/cgroup/cpu/kubepods/besteffort/pod69b754681cf0cf1bf12010694a10f2cb/cpu.cfs_quota_us
-1
/sys/fs/cgroup/cpu/kubepods/besteffort/podf3f04c5b-565e-11e9-9de2-525400123456/b4c11a95aa1b5ab85649e23a2905cb1b893c4aa3b0201e285e4fcef647bfe584/cpu.cfs_quota_us
-1
/sys/fs/cgroup/cpu/kubepods/besteffort/podf3f04c5b-565e-11e9-9de2-525400123456/cpu.cfs_quota_us
-1
/sys/fs/cgroup/cpu/kubepods/besteffort/podf3f04c5b-565e-11e9-9de2-525400123456/e9fef3c19e2966fb2870e99fcc682ac5a350fdf847ee415f6f3aaf88bbfbc17a/cpu.cfs_quota_us
-1
/sys/fs/cgroup/cpu/kubepods/besteffort/podf3f04c5b-565e-11e9-9de2-525400123456/e9fef3c19e2966fb2870e99fcc682ac5a350fdf847ee415f6f3aaf88bbfbc17a/kube-proxy/cpu.cfs_quota_us
-1
/sys/fs/cgroup/cpu/kubepods/burstable/cpu.cfs_quota_us
-1
/sys/fs/cgroup/cpu/kubepods/burstable/pod41a37c57-5665-11e9-9de2-525400123456/31e7f4b8bfe5a23919d507c8d683ea97a83698f3e5f4abc13e4f5b401a26f3f1/cpu.cfs_quota_us
50000
/sys/fs/cgroup/cpu/kubepods/burstable/pod41a37c57-5665-11e9-9de2-525400123456/62c5bf39f9bbd6dacb93b0a34e78f6c8db2ff0962bb90ad9ee564fb50b9c5554/cpu.cfs_quota_us
10000
/sys/fs/cgroup/cpu/kubepods/burstable/pod41a37c57-5665-11e9-9de2-525400123456/cpu.cfs_quota_us
60000
/sys/fs/cgroup/cpu/kubepods/burstable/pod41a37c57-5665-11e9-9de2-525400123456/f876aa7433c71aec840afc385a39e1f3c1541c4521e148c3771e8679f040b788/cpu.cfs_quota_us
-1
/sys/fs/cgroup/cpu/kubepods/burstable/pod439651677ca7971bec7b2a9a0df5a512/1c3e8063c7bba7cf0e4b6e777704e56e5bdcbb15a794d36423a4734e6d5cb751/cpu.cfs_quota_us
-1
/sys/fs/cgroup/cpu/kubepods/burstable/pod439651677ca7971bec7b2a9a0df5a512/a0ff4137e61acf00ed6dc94212dbfba5161a7cb5fe20217e2cfaa1ba91474ae8/cpu.cfs_quota_us
-1
/sys/fs/cgroup/cpu/kubepods/burstable/pod439651677ca7971bec7b2a9a0df5a512/cpu.cfs_quota_us
-1
/sys/fs/cgroup/cpu/kubepods/burstable/pod54146492ed90bfa147f56609eee8005a/55770fabdef138519c8e013a2150630a60bf7dc0e73d5899ca99e72124f23434/cpu.cfs_quota_us
-1
/sys/fs/cgroup/cpu/kubepods/burstable/pod54146492ed90bfa147f56609eee8005a/cd1698af7a1024920e12633350d98d3f45f83d775eb9ada5c76319cfe0fb9573/cpu.cfs_quota_us
-1
/sys/fs/cgroup/cpu/kubepods/burstable/pod54146492ed90bfa147f56609eee8005a/cpu.cfs_quota_us
-1
/sys/fs/cgroup/cpu/kubepods/burstable/pod58272442e226c838b193bbba4c44091e/22b075e3e9b9e8ad3bc7b5f2fb359de6ccbc751988e9b9cdb82c77a284de9847/cpu.cfs_quota_us
-1
/sys/fs/cgroup/cpu/kubepods/burstable/pod58272442e226c838b193bbba4c44091e/cpu.cfs_quota_us
-1
/sys/fs/cgroup/cpu/kubepods/burstable/pod58272442e226c838b193bbba4c44091e/ea20f04b19b1ce96cdd48f88a6a8a588a4352cd8a2fd7c6490d56b139f6d7f39/cpu.cfs_quota_us
-1
/sys/fs/cgroup/cpu/kubepods/burstable/podf3f524b6-565e-11e9-9de2-525400123456/4db061942925e99a1c687c5195fca2114e39347b22bc7cfcba51438de4efa31f/cpu.cfs_quota_us
-1
/sys/fs/cgroup/cpu/kubepods/burstable/podf3f524b6-565e-11e9-9de2-525400123456/cpu.cfs_quota_us
-1
/sys/fs/cgroup/cpu/kubepods/burstable/podf3f524b6-565e-11e9-9de2-525400123456/fde98ded39d10e37a883c8e08fadc8e541b1793f44548f0b3e028b6c1ddd9034/cpu.cfs_quota_us
-1
/sys/fs/cgroup/cpu/kubepods/burstable/podf3f5916c-565e-11e9-9de2-525400123456/c419fde416d40bf1d89c8dbe374b229ef3a48dbce5cf808c7caf5282254a4ece/cpu.cfs_quota_us
-1
/sys/fs/cgroup/cpu/kubepods/burstable/podf3f5916c-565e-11e9-9de2-525400123456/cddf60599b04a5ff107254047bceea0608cf59a82de9626f6f4b9c2778705be6/cpu.cfs_quota_us
-1
/sys/fs/cgroup/cpu/kubepods/burstable/podf3f5916c-565e-11e9-9de2-525400123456/cpu.cfs_quota_us
-1
/sys/fs/cgroup/cpu/kubepods/cpu.cfs_quota_us
-1
/sys/fs/cgroup/cpu/kubepods/pod45e519f1-565f-11e9-9de2-525400123456/75f8d2d2358fe8b7516f29472bb11378b108d0567dc820d0fc757469c2c9ca0f/cpu.cfs_quota_us
10000
/sys/fs/cgroup/cpu/kubepods/pod45e519f1-565f-11e9-9de2-525400123456/cpu.cfs_quota_us
10000
/sys/fs/cgroup/cpu/kubepods/pod45e519f1-565f-11e9-9de2-525400123456/e1a54b680bdef8d6ee1acf1fde4a2a35db950cb5ad106d11f694fc10841bdcf0/cpu.cfs_quota_us
-1
/sys/fs/cgroup/cpu/kubepods/pod5c061c35-5666-11e9-9de2-525400123456/12d48c76f025ea5b0a2f3cbab6765b75c5666ef74528da4ef7d8c4260a075faa/cpu.cfs_quota_us
20000
/sys/fs/cgroup/cpu/kubepods/pod5c061c35-5666-11e9-9de2-525400123456/9a9a45ea9730ef8a313070103f2701a4d1942c90eb985938d77b968f5cac460e/cpu.cfs_quota_us
40000
/sys/fs/cgroup/cpu/kubepods/pod5c061c35-5666-11e9-9de2-525400123456/cpu.cfs_quota_us
60000
/sys/fs/cgroup/cpu/kubepods/pod5c061c35-5666-11e9-9de2-525400123456/de7559571058b943737889052c3ebd48cba281b7ab3b0b2327825d8256e570da/cpu.cfs_quota_us
-1
/sys/fs/cgroup/cpu/kubepods/pod99b66879-565f-11e9-9de2-525400123456/2882ebb5acb99f5b09ee41954720101a954847151951ff56e0e3a919a2044a5a/cpu.cfs_quota_us
-1
/sys/fs/cgroup/cpu/kubepods/pod99b66879-565f-11e9-9de2-525400123456/2af8973b96abe2e023eb343cb87989a33c08b61f1bf2eead306da8f15eb74026/cpu.cfs_quota_us
100000
/sys/fs/cgroup/cpu/kubepods/pod99b66879-565f-11e9-9de2-525400123456/a547d6c57f8b08a493156febb9b7071320ce1d87631e2e7bea142b7ac9351f23/cpu.cfs_quota_us
200000
/sys/fs/cgroup/cpu/kubepods/pod99b66879-565f-11e9-9de2-525400123456/cpu.cfs_quota_us
300000
#shopt -s globstar
#for i in `ls /sys/fs/cgroup/cpu/kubepods/**/cpu.cfs_period_us`; do echo $i && cat $i; done
/sys/fs/cgroup/cpu/kubepods/besteffort/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/besteffort/pod55148103-5666-11e9-9de2-525400123456/02f5cf17015d31beb2462857e1773754221712389d17df5a3a1e636bc04daaac/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/besteffort/pod55148103-5666-11e9-9de2-525400123456/1dad0913e3a373e1742ef9dbc707fbcaab65fa1d394d5dc5c7df6c6d36569db7/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/besteffort/pod55148103-5666-11e9-9de2-525400123456/6d50e7ca6b84fbd4195155e812e04e7b5976666818fe69e6d28832e63fae639f/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/besteffort/pod55148103-5666-11e9-9de2-525400123456/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/besteffort/pod69b754681cf0cf1bf12010694a10f2cb/6770f6eccae402706a68b71d9a593cc9f64aa2961419f2bfad4f57b265ded453/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/besteffort/pod69b754681cf0cf1bf12010694a10f2cb/bab92c486dd6626714666e28e41672dc2250302699d687d7850b60550b8f03ad/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/besteffort/pod69b754681cf0cf1bf12010694a10f2cb/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/besteffort/podf3f04c5b-565e-11e9-9de2-525400123456/b4c11a95aa1b5ab85649e23a2905cb1b893c4aa3b0201e285e4fcef647bfe584/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/besteffort/podf3f04c5b-565e-11e9-9de2-525400123456/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/besteffort/podf3f04c5b-565e-11e9-9de2-525400123456/e9fef3c19e2966fb2870e99fcc682ac5a350fdf847ee415f6f3aaf88bbfbc17a/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/besteffort/podf3f04c5b-565e-11e9-9de2-525400123456/e9fef3c19e2966fb2870e99fcc682ac5a350fdf847ee415f6f3aaf88bbfbc17a/kube-proxy/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/burstable/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/burstable/pod41a37c57-5665-11e9-9de2-525400123456/31e7f4b8bfe5a23919d507c8d683ea97a83698f3e5f4abc13e4f5b401a26f3f1/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/burstable/pod41a37c57-5665-11e9-9de2-525400123456/62c5bf39f9bbd6dacb93b0a34e78f6c8db2ff0962bb90ad9ee564fb50b9c5554/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/burstable/pod41a37c57-5665-11e9-9de2-525400123456/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/burstable/pod41a37c57-5665-11e9-9de2-525400123456/f876aa7433c71aec840afc385a39e1f3c1541c4521e148c3771e8679f040b788/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/burstable/pod439651677ca7971bec7b2a9a0df5a512/1c3e8063c7bba7cf0e4b6e777704e56e5bdcbb15a794d36423a4734e6d5cb751/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/burstable/pod439651677ca7971bec7b2a9a0df5a512/a0ff4137e61acf00ed6dc94212dbfba5161a7cb5fe20217e2cfaa1ba91474ae8/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/burstable/pod439651677ca7971bec7b2a9a0df5a512/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/burstable/pod54146492ed90bfa147f56609eee8005a/55770fabdef138519c8e013a2150630a60bf7dc0e73d5899ca99e72124f23434/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/burstable/pod54146492ed90bfa147f56609eee8005a/cd1698af7a1024920e12633350d98d3f45f83d775eb9ada5c76319cfe0fb9573/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/burstable/pod54146492ed90bfa147f56609eee8005a/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/burstable/pod58272442e226c838b193bbba4c44091e/22b075e3e9b9e8ad3bc7b5f2fb359de6ccbc751988e9b9cdb82c77a284de9847/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/burstable/pod58272442e226c838b193bbba4c44091e/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/burstable/pod58272442e226c838b193bbba4c44091e/ea20f04b19b1ce96cdd48f88a6a8a588a4352cd8a2fd7c6490d56b139f6d7f39/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/burstable/podf3f524b6-565e-11e9-9de2-525400123456/4db061942925e99a1c687c5195fca2114e39347b22bc7cfcba51438de4efa31f/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/burstable/podf3f524b6-565e-11e9-9de2-525400123456/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/burstable/podf3f524b6-565e-11e9-9de2-525400123456/fde98ded39d10e37a883c8e08fadc8e541b1793f44548f0b3e028b6c1ddd9034/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/burstable/podf3f5916c-565e-11e9-9de2-525400123456/c419fde416d40bf1d89c8dbe374b229ef3a48dbce5cf808c7caf5282254a4ece/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/burstable/podf3f5916c-565e-11e9-9de2-525400123456/cddf60599b04a5ff107254047bceea0608cf59a82de9626f6f4b9c2778705be6/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/burstable/podf3f5916c-565e-11e9-9de2-525400123456/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/pod45e519f1-565f-11e9-9de2-525400123456/75f8d2d2358fe8b7516f29472bb11378b108d0567dc820d0fc757469c2c9ca0f/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/pod45e519f1-565f-11e9-9de2-525400123456/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/pod45e519f1-565f-11e9-9de2-525400123456/e1a54b680bdef8d6ee1acf1fde4a2a35db950cb5ad106d11f694fc10841bdcf0/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/pod5c061c35-5666-11e9-9de2-525400123456/12d48c76f025ea5b0a2f3cbab6765b75c5666ef74528da4ef7d8c4260a075faa/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/pod5c061c35-5666-11e9-9de2-525400123456/9a9a45ea9730ef8a313070103f2701a4d1942c90eb985938d77b968f5cac460e/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/pod5c061c35-5666-11e9-9de2-525400123456/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/pod5c061c35-5666-11e9-9de2-525400123456/de7559571058b943737889052c3ebd48cba281b7ab3b0b2327825d8256e570da/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/pod99b66879-565f-11e9-9de2-525400123456/2882ebb5acb99f5b09ee41954720101a954847151951ff56e0e3a919a2044a5a/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/pod99b66879-565f-11e9-9de2-525400123456/2af8973b96abe2e023eb343cb87989a33c08b61f1bf2eead306da8f15eb74026/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/pod99b66879-565f-11e9-9de2-525400123456/a547d6c57f8b08a493156febb9b7071320ce1d87631e2e7bea142b7ac9351f23/cpu.cfs_period_us
100000
/sys/fs/cgroup/cpu/kubepods/pod99b66879-565f-11e9-9de2-525400123456/cpu.cfs_period_us
100000

Memory

  • Best effort have no limits
  • Some Burstable pods have pod level memory limits set (and container level)
    • Pause is limited by POD level memory limits
  • Guaranteed pods have pod level limits
    • Pause is limited by POD level memory limits
#for i in `ls /sys/fs/cgroup/memory/kubepods/**/memory.limit_in_bytes`; do echo $i && cat $i; done
/sys/fs/cgroup/memory/kubepods/besteffort/memory.limit_in_bytes
9223372036854771712
/sys/fs/cgroup/memory/kubepods/besteffort/pod55148103-5666-11e9-9de2-525400123456/02f5cf17015d31beb2462857e1773754221712389d17df5a3a1e636bc04daaac/memory.limit_in_bytes
9223372036854771712
/sys/fs/cgroup/memory/kubepods/besteffort/pod55148103-5666-11e9-9de2-525400123456/1dad0913e3a373e1742ef9dbc707fbcaab65fa1d394d5dc5c7df6c6d36569db7/memory.limit_in_bytes
9223372036854771712
/sys/fs/cgroup/memory/kubepods/besteffort/pod55148103-5666-11e9-9de2-525400123456/6d50e7ca6b84fbd4195155e812e04e7b5976666818fe69e6d28832e63fae639f/memory.limit_in_bytes
9223372036854771712
/sys/fs/cgroup/memory/kubepods/besteffort/pod55148103-5666-11e9-9de2-525400123456/memory.limit_in_bytes
9223372036854771712
/sys/fs/cgroup/memory/kubepods/besteffort/pod69b754681cf0cf1bf12010694a10f2cb/6770f6eccae402706a68b71d9a593cc9f64aa2961419f2bfad4f57b265ded453/memory.limit_in_bytes
9223372036854771712
/sys/fs/cgroup/memory/kubepods/besteffort/pod69b754681cf0cf1bf12010694a10f2cb/bab92c486dd6626714666e28e41672dc2250302699d687d7850b60550b8f03ad/memory.limit_in_bytes
9223372036854771712
/sys/fs/cgroup/memory/kubepods/besteffort/pod69b754681cf0cf1bf12010694a10f2cb/memory.limit_in_bytes
9223372036854771712
/sys/fs/cgroup/memory/kubepods/besteffort/podf3f04c5b-565e-11e9-9de2-525400123456/b4c11a95aa1b5ab85649e23a2905cb1b893c4aa3b0201e285e4fcef647bfe584/memory.limit_in_bytes
9223372036854771712
/sys/fs/cgroup/memory/kubepods/besteffort/podf3f04c5b-565e-11e9-9de2-525400123456/e9fef3c19e2966fb2870e99fcc682ac5a350fdf847ee415f6f3aaf88bbfbc17a/kube-proxy/memory.limit_in_bytes
9223372036854771712
/sys/fs/cgroup/memory/kubepods/besteffort/podf3f04c5b-565e-11e9-9de2-525400123456/e9fef3c19e2966fb2870e99fcc682ac5a350fdf847ee415f6f3aaf88bbfbc17a/memory.limit_in_bytes
9223372036854771712
/sys/fs/cgroup/memory/kubepods/besteffort/podf3f04c5b-565e-11e9-9de2-525400123456/memory.limit_in_bytes
9223372036854771712
/sys/fs/cgroup/memory/kubepods/burstable/memory.limit_in_bytes
9223372036854771712
/sys/fs/cgroup/memory/kubepods/burstable/pod41a37c57-5665-11e9-9de2-525400123456/31e7f4b8bfe5a23919d507c8d683ea97a83698f3e5f4abc13e4f5b401a26f3f1/memory.limit_in_bytes
419430400
/sys/fs/cgroup/memory/kubepods/burstable/pod41a37c57-5665-11e9-9de2-525400123456/62c5bf39f9bbd6dacb93b0a34e78f6c8db2ff0962bb90ad9ee564fb50b9c5554/memory.limit_in_bytes
209715200
/sys/fs/cgroup/memory/kubepods/burstable/pod41a37c57-5665-11e9-9de2-525400123456/f876aa7433c71aec840afc385a39e1f3c1541c4521e148c3771e8679f040b788/memory.limit_in_bytes
9223372036854771712
/sys/fs/cgroup/memory/kubepods/burstable/pod41a37c57-5665-11e9-9de2-525400123456/memory.limit_in_bytes
629145600
/sys/fs/cgroup/memory/kubepods/burstable/pod439651677ca7971bec7b2a9a0df5a512/1c3e8063c7bba7cf0e4b6e777704e56e5bdcbb15a794d36423a4734e6d5cb751/memory.limit_in_bytes
9223372036854771712
/sys/fs/cgroup/memory/kubepods/burstable/pod439651677ca7971bec7b2a9a0df5a512/a0ff4137e61acf00ed6dc94212dbfba5161a7cb5fe20217e2cfaa1ba91474ae8/memory.limit_in_bytes
9223372036854771712
/sys/fs/cgroup/memory/kubepods/burstable/pod439651677ca7971bec7b2a9a0df5a512/memory.limit_in_bytes
9223372036854771712
/sys/fs/cgroup/memory/kubepods/burstable/pod54146492ed90bfa147f56609eee8005a/55770fabdef138519c8e013a2150630a60bf7dc0e73d5899ca99e72124f23434/memory.limit_in_bytes
9223372036854771712
/sys/fs/cgroup/memory/kubepods/burstable/pod54146492ed90bfa147f56609eee8005a/cd1698af7a1024920e12633350d98d3f45f83d775eb9ada5c76319cfe0fb9573/memory.limit_in_bytes
9223372036854771712
/sys/fs/cgroup/memory/kubepods/burstable/pod54146492ed90bfa147f56609eee8005a/memory.limit_in_bytes
9223372036854771712
/sys/fs/cgroup/memory/kubepods/burstable/pod58272442e226c838b193bbba4c44091e/22b075e3e9b9e8ad3bc7b5f2fb359de6ccbc751988e9b9cdb82c77a284de9847/memory.limit_in_bytes
9223372036854771712
/sys/fs/cgroup/memory/kubepods/burstable/pod58272442e226c838b193bbba4c44091e/ea20f04b19b1ce96cdd48f88a6a8a588a4352cd8a2fd7c6490d56b139f6d7f39/memory.limit_in_bytes
9223372036854771712
/sys/fs/cgroup/memory/kubepods/burstable/pod58272442e226c838b193bbba4c44091e/memory.limit_in_bytes
9223372036854771712
/sys/fs/cgroup/memory/kubepods/burstable/podf3f524b6-565e-11e9-9de2-525400123456/4db061942925e99a1c687c5195fca2114e39347b22bc7cfcba51438de4efa31f/memory.limit_in_bytes
178257920
/sys/fs/cgroup/memory/kubepods/burstable/podf3f524b6-565e-11e9-9de2-525400123456/fde98ded39d10e37a883c8e08fadc8e541b1793f44548f0b3e028b6c1ddd9034/memory.limit_in_bytes
9223372036854771712
/sys/fs/cgroup/memory/kubepods/burstable/podf3f524b6-565e-11e9-9de2-525400123456/memory.limit_in_bytes
178257920
/sys/fs/cgroup/memory/kubepods/burstable/podf3f5916c-565e-11e9-9de2-525400123456/c419fde416d40bf1d89c8dbe374b229ef3a48dbce5cf808c7caf5282254a4ece/memory.limit_in_bytes
178257920
/sys/fs/cgroup/memory/kubepods/burstable/podf3f5916c-565e-11e9-9de2-525400123456/cddf60599b04a5ff107254047bceea0608cf59a82de9626f6f4b9c2778705be6/memory.limit_in_bytes
9223372036854771712
/sys/fs/cgroup/memory/kubepods/burstable/podf3f5916c-565e-11e9-9de2-525400123456/memory.limit_in_bytes
178257920
/sys/fs/cgroup/memory/kubepods/memory.limit_in_bytes
7851159552
/sys/fs/cgroup/memory/kubepods/pod45e519f1-565f-11e9-9de2-525400123456/75f8d2d2358fe8b7516f29472bb11378b108d0567dc820d0fc757469c2c9ca0f/memory.limit_in_bytes
52428800
/sys/fs/cgroup/memory/kubepods/pod45e519f1-565f-11e9-9de2-525400123456/e1a54b680bdef8d6ee1acf1fde4a2a35db950cb5ad106d11f694fc10841bdcf0/memory.limit_in_bytes
9223372036854771712
/sys/fs/cgroup/memory/kubepods/pod45e519f1-565f-11e9-9de2-525400123456/memory.limit_in_bytes
52428800
/sys/fs/cgroup/memory/kubepods/pod5c061c35-5666-11e9-9de2-525400123456/12d48c76f025ea5b0a2f3cbab6765b75c5666ef74528da4ef7d8c4260a075faa/memory.limit_in_bytes
209715200
/sys/fs/cgroup/memory/kubepods/pod5c061c35-5666-11e9-9de2-525400123456/9a9a45ea9730ef8a313070103f2701a4d1942c90eb985938d77b968f5cac460e/memory.limit_in_bytes
419430400
/sys/fs/cgroup/memory/kubepods/pod5c061c35-5666-11e9-9de2-525400123456/de7559571058b943737889052c3ebd48cba281b7ab3b0b2327825d8256e570da/memory.limit_in_bytes
9223372036854771712
/sys/fs/cgroup/memory/kubepods/pod5c061c35-5666-11e9-9de2-525400123456/memory.limit_in_bytes
629145600
/sys/fs/cgroup/memory/kubepods/pod99b66879-565f-11e9-9de2-525400123456/2882ebb5acb99f5b09ee41954720101a954847151951ff56e0e3a919a2044a5a/memory.limit_in_bytes
9223372036854771712
/sys/fs/cgroup/memory/kubepods/pod99b66879-565f-11e9-9de2-525400123456/2af8973b96abe2e023eb343cb87989a33c08b61f1bf2eead306da8f15eb74026/memory.limit_in_bytes
209715200
/sys/fs/cgroup/memory/kubepods/pod99b66879-565f-11e9-9de2-525400123456/a547d6c57f8b08a493156febb9b7071320ce1d87631e2e7bea142b7ac9351f23/memory.limit_in_bytes
419430400
/sys/fs/cgroup/memory/kubepods/pod99b66879-565f-11e9-9de2-525400123456/memory.limit_in_bytes
629145600

Observed scheduling

#top -b
top - 23:47:30 up 22:52,  4 users,  load average: 9.59, 9.14, 8.63
Tasks: 206 total,   9 running, 133 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.1 us,  8.5 sy,  0.0 ni, 89.1 id,  0.1 wa,  0.0 hi,  0.1 si,  0.1 st
KiB Mem :  8167148 total,  4444140 free,   692316 used,  3030692 buff/cache
KiB Swap:        0 total,        0 free,        0 used.  7184956 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 8555 root      20   0    1288      4      0 R 100.0  0.0  80:27.66 md5sum
 8624 root      20   0    1288      4      0 R 100.0  0.0  80:26.17 md5sum
25546 root      20   0    1288      4      0 R 100.0  0.0  28:16.58 md5sum
25691 root      20   0    1288      4      0 R  93.8  0.0  28:20.72 md5sum
10705 root      20   0    1288      4      0 R  37.5  0.0  19:53.62 md5sum
26277 root      20   0    1288      4      0 R  31.2  0.0  12:43.99 md5sum
26360 root      20   0    1288      4      0 R  12.5  0.0   6:24.16 md5sum
 6409 root      20   0   44532   3932   3372 R   6.2  0.0   0:00.01 top
26252 root      20   0   11788   6076   4184 S   6.2  0.1   0:00.95 containerd-shim
32405 root      20   0 2163916 100936  65528 S   6.2  1.2   2:32.42 kubelet
    1 root      20   0  225524   9308   6644 S   0.0  0.1   0:40.32 systemd
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.02 kthreadd
Non-terminated Pods:         (12 in total)
  Namespace                  Name                                     CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                  ----                                     ------------  ----------  ---------------  -------------  ---
  default                    beff-2                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         33m
  default                    burst-2                                  300m (4%)     600m (8%)   400Mi (5%)       600Mi (8%)     41m
  default                    guar-2                                   600m (8%)     600m (8%)   600Mi (8%)       600Mi (8%)     33m
  default                    guar-2s                                  3 (42%)       3 (42%)     600Mi (8%)       600Mi (8%)     81m
  kube-system                coredns-fb8b8dccf-ng6st                  100m (1%)     0 (0%)      70Mi (0%)        170Mi (2%)     86m
  kube-system                coredns-fb8b8dccf-tctwp                  100m (1%)     0 (0%)      70Mi (0%)        170Mi (2%)     86m
  kube-system                etcd-bored-pelinor                       0 (0%)        0 (0%)      0 (0%)           0 (0%)         85m
  kube-system                kube-apiserver-bored-pelinor             250m (3%)     0 (0%)      0 (0%)           0 (0%)         85m
  kube-system                kube-controller-manager-bored-pelinor    200m (2%)     0 (0%)      0 (0%)           0 (0%)         85m
  kube-system                kube-flannel-ds-amd64-gtwrh              100m (1%)     100m (1%)   50Mi (0%)        50Mi (0%)      84m
  kube-system                kube-proxy-vk6j9                         0 (0%)        0 (0%)      0 (0%)           0 (0%)         86m
  kube-system                kube-scheduler-bored-pelinor             100m (1%)     0 (0%)      0 (0%)           0 (0%)         85m
#kubectl top pod --all-namespaces
NAMESPACE     NAME                                    CPU(cores)   MEMORY(bytes)
default       beff-2                                  1929m        1Mi
default       burst-2                                 601m         1Mi
default       guar-2                                  600m         1Mi
default       guar-2s                                 2000m        1Mi
kube-system   coredns-fb8b8dccf-ng6st                 2m           11Mi
kube-system   coredns-fb8b8dccf-tctwp                 2m           9Mi
kube-system   etcd-bored-pelinor                      14m          40Mi
kube-system   kube-apiserver-bored-pelinor            21m          233Mi
kube-system   kube-controller-manager-bored-pelinor   8m           40Mi
kube-system   kube-flannel-ds-amd64-gtwrh             2m           11Mi
kube-system   kube-proxy-vk6j9                        1m           14Mi
kube-system   kube-scheduler-bored-pelinor            1m           12Mi
kube-system   metrics-server-78b6bc9ddf-z8fvw         1m           13Mi

What about Kata

cpusets

  • set at container level (NOT pod level)
    • pause container is not pinned
  • this is not good for kata as we cannot use the pod or pause (unless we muck up pause)

cpu shares

  • set pod level (and container level)
    • this is good for kata as QEMU can run in the pod cgroup
  • cpu is compressible so shares being too low is not an issue

cfs_quota_us

  • set at pod level
    • this is good for kata as qemu can run in the pod cgroup
  • most pods are ubound, so this is good for kata in a way - pods that are bound will need correctly passed upper limits (see sandbox overhead proposal described later)

memory.limit_in_bytes

  • set at pod level

    • this is good for kata as qemu can run in the pod cgroup
    • but may be too small as defined resulting in OOM
    • So absolutely need propotional pod overhead for bound pods
  • Once @egernst adds support for sandbox overhead they should be additive to pod level limits, when limits actually apply.

  • However the sandbox cannot be the pause container. It needs to be the pod itself (and that what is right).

  • However it leaves out the issue of cpusets.

  • Given that cpusets should be applicable to pod cgroup but current not applied to the same, it may make sense to modify the upstream logic to set them up at pod level.

    • note: However when the reconciliation loop runs it should first open up the pod and apply container sets and then close the pod. This will make the loop a little more complex unlike the current implementation.

How does core-metrics obtain resource utilization

Obtaining Resource Utilization

Using the raw kubernetes API, you will see the breakdown of the metrics at the pod level

mrcastel@bored-pelinor:~$ kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/default/pods/guar-2" | jq -C .
{
  "kind": "PodMetrics",
  "apiVersion": "metrics.k8s.io/v1beta1",
  "metadata": {
    "name": "guar-2",
    "namespace": "default",
    "selfLink": "/apis/metrics.k8s.io/v1beta1/namespaces/default/pods/guar-2",
    "creationTimestamp": "2019-04-04T23:52:59Z"
  },
  "timestamp": "2019-04-04T23:51:54Z",
  "window": "30s",
  "containers": [
    {
      "name": "busybee",
      "usage": {
        "cpu": "398739040n",
        "memory": "744Ki"
      }
    },
    {
      "name": "busybum",
      "usage": {
        "cpu": "200848120n",
        "memory": "852Ki"
      }
    }
  ]
}
mrcastel@bored-pelinor:~$ kubectl get pod --all-namespaces -o=custom-columns=NAME:.metadata.name,UID:.metadata.uid | grep guar-2
guar-2                                  839f41ff-5728-11e9-9de2-525400123456

Looking at the cgroups

#for i in `ls /sys/fs/cgroup/memory/kubepods/pod839f41ff-5728-11e9-9de2-525400123456/**/memory.usage_in_bytes`; do echo $i && cat $i; done
/sys/fs/cgroup/memory/kubepods/pod839f41ff-5728-11e9-9de2-525400123456/2de4cc4497c0d69aefc9e11dc64536a9dcd4d6a3cc51fc4dfd138028ef9b6314/memory.usage_in_bytes
761856
/sys/fs/cgroup/memory/kubepods/pod839f41ff-5728-11e9-9de2-525400123456/66be758085f37040c49d94d04129ec03a05b8f65fe8419f6665d365a0eb14625/memory.usage_in_bytes
872448
/sys/fs/cgroup/memory/kubepods/pod839f41ff-5728-11e9-9de2-525400123456/ab621c92b945758ee275315251c7a491467aabbc0d8fdaf02158c50f98e14ddc/memory.usage_in_bytes
614400
/sys/fs/cgroup/memory/kubepods/pod839f41ff-5728-11e9-9de2-525400123456/memory.usage_in_bytes
2248704

Note: The pause container has memory usage, but the metrics does not report or care about it.

mrcastel@bored-pelinor:~$ kubectl top pod
NAME      CPU(cores)   MEMORY(bytes)
beff-2    1955m        1Mi
burst-2   599m         1Mi
guar-2    597m         1Mi
guar-2s   2001m        1Mi

Note: The total memory usage here does not quite match up (maybe rounded down?)

/sys/fs/cgroup/cpu/kubepods/pod839f41ff-5728-11e9-9de2-525400123456/cpuacct.usage
3664938987887
#for i in `ls /sys/fs/cgroup/cpu/kubepods/pod839f41ff-5728-11e9-9de2-525400123456/**/cpuacct.usage`; do echo $i && cat $i; done
/sys/fs/cgroup/cpu/kubepods/pod839f41ff-5728-11e9-9de2-525400123456/2de4cc4497c0d69aefc9e11dc64536a9dcd4d6a3cc51fc4dfd138028ef9b6314/cpuacct.usage
2436732506834
/sys/fs/cgroup/cpu/kubepods/pod839f41ff-5728-11e9-9de2-525400123456/66be758085f37040c49d94d04129ec03a05b8f65fe8419f6665d365a0eb14625/cpuacct.usage
1228524887574
/sys/fs/cgroup/cpu/kubepods/pod839f41ff-5728-11e9-9de2-525400123456/ab621c92b945758ee275315251c7a491467aabbc0d8fdaf02158c50f98e14ddc/cpuacct.usage
42275543
/sys/fs/cgroup/cpu/kubepods/pod839f41ff-5728-11e9-9de2-525400123456/cpuacct.usage
3665299669951

Note: The cpu usage is cumulative in nano seconds. So a bit harder to co-relate.

Open: How can Kata provide these numbers at a container level?

References

https://medium.com/@betz.mark/understanding-resource-limits-in-kubernetes-memory-6b41e9a955f9 https://medium.com/@betz.mark/understanding-resource-limits-in-kubernetes-cpu-time-9eff74d3161b https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/node-allocatable.md#recommended-cgroups-setup rancher/rancher#17177 https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/ https://libvirt.org/formatdomain.html https://github.com/torvalds/linux/blob/master/Documentation/cgroup-v1/cpuacct.txt https://godoc.org/k8s.io/kubelet/config/v1beta1#KubeletConfiguration https://kubernetes.io/docs/tasks/administer-cluster/manage-resources/memory-default-namespace/ https://docs.openshift.com/container-platform/3.3/admin_guide/allocating_node_resources.html#node-enforcement https://www.certdepot.net/rhel7-get-started-cgroups/

Cgroups primer

Let us start two jobs

#bash -c "exec -a jobmore stress-ng --cpu 3 --timeout 120m"stress-ng --cpu 8 --timeout 120m" &
#bash -c "exec -a jobless stress-ng --cpu 3 --timeout 120m"stress-ng --cpu 8 --timeout 120m" &

and then examine the resource usage

#htop
...
24036 mrcastel   20   0 53800  5712  5392 S  0.0  0.1  0:00.00 jobless --cpu 3 --timeout 120m
24037 mrcastel   20   0 54448  7184  4336 R 68.6  0.1  3:43.63 jobless --cpu 3 --timeout 120m
24038 mrcastel   20   0 54448  7184  4336 R 68.0  0.1  3:43.67 jobless --cpu 3 --timeout 120m
24039 mrcastel   20   0 54448  7184  4336 R 67.3  0.1  3:43.67 jobless --cpu 3 --timeout 120m
24137 mrcastel   20   0 53800  5808  5492 S  0.0  0.1  0:00.01 jobmore --cpu 3 --timeout 120m
24138 mrcastel   20   0 54444  7160  4308 R 100.  0.1  4:08.10 jobmore --cpu 3 --timeout 120m
24139 mrcastel   20   0 54444  7160  4308 R 100.  0.1  4:07.73 jobmore --cpu 3 --timeout 120m
24140 mrcastel   20   0 54444  7160  4308 R 100.  0.1  4:08.31 jobmore --cpu 3 --timeout 120m

At this point the jobs are using all of the CPUs on the system the best they can

Constrain the quota (upper bound)

#mkdir /sys/fs/cgroup/cpu/testcg
#mkdir /sys/fs/cgroup/cpu/testcg/jobless
#mkdir /sys/fs/cgroup/cpu/testcg/jobmore

#echo "24036" > /sys/fs/cgroup/cpu/testcg/jobless/tasks
#echo "24037" >> /sys/fs/cgroup/cpu/testcg/jobless/tasks
#echo "24038" >> /sys/fs/cgroup/cpu/testcg/jobless/tasks
#echo "24039" >> /sys/fs/cgroup/cpu/testcg/jobless/tasks

#echo "24137" > /sys/fs/cgroup/cpu/testcg/jobmore/tasks
#echo "24138" >> /sys/fs/cgroup/cpu/testcg/jobmore/tasks
#echo "24139" >> /sys/fs/cgroup/cpu/testcg/jobmore/tasks
#echo "24140" >> /sys/fs/cgroup/cpu/testcg/jobmore/tasks

Now let us upper bound at the parent cgroup level and split the time amongst the children

#echo 300000 > /sys/fs/cgroup/cpu/testcg/cpu.cfs_quota_us
#echo 100000 > /sys/fs/cgroup/cpu/testcg/jobless/cpu.cfs_quota_us
#echo 200000 > /sys/fs/cgroup/cpu/testcg/jobmore/cpu.cfs_quota_us

and look at the resource utilization

#htop
...
24036 mrcastel   20   0 53800  5712  5392 S  0.0  0.1  0:00.00 jobless --cpu 3 --timeout 120m
24037 mrcastel   20   0 54448  7184  4336 R 32.4  0.1  5:53.28 jobless --cpu 3 --timeout 120m
24038 mrcastel   20   0 54448  7184  4336 R 34.4  0.1  5:53.68 jobless --cpu 3 --timeout 120m
24039 mrcastel   20   0 54448  7184  4336 R 34.4  0.1  5:53.53 jobless --cpu 3 --timeout 120m
24137 mrcastel   20   0 53800  5808  5492 S  0.0  0.1  0:00.01 jobmore --cpu 3 --timeout 120m
24138 mrcastel   20   0 54444  7160  4308 R 66.1  0.1  6:49.54 jobmore --cpu 3 --timeout 120m
24139 mrcastel   20   0 54444  7160  4308 R 66.1  0.1  6:49.39 jobmore --cpu 3 --timeout 120m
24140 mrcastel   20   0 54444  7160  4308 R 68.0  0.1  6:50.38 jobmore --cpu 3 --timeout 120m

So we see that all the jobs fit within 3 CPUs. And furthermore jobless gets only 1 CPU and jobmore gets 2 CPUs.

Assured quota (lower bound)

Now let us give them the same upper bound

/sys/fs/cgroup/cpu/testcg/cpu.cfs_quota_us
200000
/sys/fs/cgroup/cpu/testcg/jobless/cpu.cfs_quota_us
100000
/sys/fs/cgroup/cpu/testcg/jobmore/cpu.cfs_quota_us
100000

We see the CPU get split evenly

24036 mrcastel   20   0 53800  5712  5392 S  0.0  0.1  0:00.00 jobless --cpu 3 --timeout 120m
24037 mrcastel   20   0 54448  7184  4336 R 33.3  0.1  9:24.10 jobless --cpu 3 --timeout 120m
24038 mrcastel   20   0 54448  7184  4336 R 33.3  0.1  9:24.36 jobless --cpu 3 --timeout 120m
24039 mrcastel   20   0 54448  7184  4336 R 33.9  0.1  9:24.20 jobless --cpu 3 --timeout 120m
24137 mrcastel   20   0 53800  5808  5492 S  0.0  0.1  0:00.01 jobmore --cpu 3 --timeout 120m
24138 mrcastel   20   0 54444  7160  4308 R 33.3  0.1 11:54.74 jobmore --cpu 3 --timeout 120m
24139 mrcastel   20   0 54444  7160  4308 R 32.6  0.1 11:55.03 jobmore --cpu 3 --timeout 120m
24140 mrcastel   20   0 54444  7160  4308 R 33.3  0.1 11:55.74 jobmore --cpu 3 --timeout 120m

But let us say jobmore is more important, so let us setup the shares accordingly

/sys/fs/cgroup/cpu/testcg/cpu.shares
1024
/sys/fs/cgroup/cpu/testcg/jobless/cpu.shares
24
/sys/fs/cgroup/cpu/testcg/jobmore/cpu.shares
1000

We do not quite see what we expected, that is because there are enough free CPUs floating

24036 mrcastel   20   0 53800  5712  5392 S  0.0  0.1  0:00.00 jobless --cpu 3 --timeout 120m
24037 mrcastel   20   0 54448  7184  4336 R 33.3  0.1  9:24.10 jobless --cpu 3 --timeout 120m
24038 mrcastel   20   0 54448  7184  4336 R 33.3  0.1  9:24.36 jobless --cpu 3 --timeout 120m
24039 mrcastel   20   0 54448  7184  4336 R 33.9  0.1  9:24.20 jobless --cpu 3 --timeout 120m
24137 mrcastel   20   0 53800  5808  5492 S  0.0  0.1  0:00.01 jobmore --cpu 3 --timeout 120m
24138 mrcastel   20   0 54444  7160  4308 R 33.3  0.1 11:54.74 jobmore --cpu 3 --timeout 120m
24139 mrcastel   20   0 54444  7160  4308 R 32.6  0.1 11:55.03 jobmore --cpu 3 --timeout 120m
24140 mrcastel   20   0 54444  7160  4308 R 33.3  0.1 11:55.74 jobmore --cpu 3 --timeout 120m

So let us force all the tasks to the same cpu

$ taskset -p 10 24036
$ taskset -p 10 24037
$ taskset -p 10 24038
$ taskset -p 10 24039

$ taskset -p 10 24137
$ taskset -p 10 24138
$ taskset -p 10 24139
$ taskset -p 10 24140

Now you see jobmore get the correct lower bound, even though both jobs have the same upper bound.

24036 mrcastel   20   0 53800  5712  5392 S  0.0  0.1  0:00.00 jobless --cpu 3 --timeout 120m
24037 mrcastel   20   0 54448  7184  4336 R  0.7  0.1 15:34.61 jobless --cpu 3 --timeout 120m
24038 mrcastel   20   0 54448  7184  4336 R  0.7  0.1 15:35.11 jobless --cpu 3 --timeout 120m
24039 mrcastel   20   0 54448  7184  4336 R  1.3  0.1 15:38.99 jobless --cpu 3 --timeout 120m
24137 mrcastel   20   0 53800  5808  5492 S  0.0  0.1  0:00.01 jobmore --cpu 3 --timeout 120m
24138 mrcastel   20   0 54444  7160  4308 R 32.9  0.1 18:15.75 jobmore --cpu 3 --timeout 120m
24139 mrcastel   20   0 54444  7160  4308 R 32.3  0.1 18:14.36 jobmore --cpu 3 --timeout 120m
24140 mrcastel   20   0 54444  7160  4308 R 32.3  0.1 18:15.22 jobmore --cpu 3 --timeout 120m
@sb1975
Copy link

sb1975 commented Oct 30, 2019

Do we need to have seperate cgroups created for Multitenancy. Assuming Multitenancy here is each tenant having its own namespace ?

@jmunty
Copy link

jmunty commented Dec 7, 2020

this is the best resource that i've come across so far explaining/linking details on node-allocatable stuff!! congrats!
sadly the google page ranking of this page is really bad.. cant remember how i came across it, but the search string was pretty obscure. maybe post to a blog or something?

@mcastelino
Copy link
Author

@jmunty thanks. I did try and publish this https://medium.com/@mcastelino/kubernetes-resource-management-deep-dive-b337ba15359c and broke it down a little bit. But the gist is more complete.

This needs an update with some more changes made to k8s recently and cgroups-v2 and updated pod overhead changes
https://kubernetes.io/docs/concepts/scheduling-eviction/pod-overhead/ which we added to ensure we have close to perfect accounting.

We still find that we are close but not perfect if you want to push resource usage to the limit.

@mcastelino
Copy link
Author

Do we need to have seperate cgroups created for Multitenancy. Assuming Multitenancy here is each tenant having its own namespace ?

@sb1975 no. This is a system level accounting. The multi-tenancy is logic and rides on top of this. I just turned on github git notifications and saw this comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment