Skip to content

Instantly share code, notes, and snippets.

@abhioncbr
Last active April 23, 2021 03:50
Show Gist options
  • Save abhioncbr/e433ebfd28c0da6788cc3ced85f80738 to your computer and use it in GitHub Desktop.
Save abhioncbr/e433ebfd28c0da6788cc3ced85f80738 to your computer and use it in GitHub Desktop.

I will start this post by acknowledging the appreciation for my last post; This post is in continuation to my previous post, here is the list of post.

Let's first quickly revisit the understanding from the last post; pod QoS class determines which pod will remove first from a node in-case of eviction by Kubelet. There are three types of QoS classes, i.e. Best Effort, Burstable & Guaranteed and among them Best Effort pods will evict first, following Burstable pods second and at last the Guaranteed one.

In this post, we are going to cover node resources, what causes eviction, what are soft and hard threshold signals. Let's start with node resources.

Node resources

Node resources mean the capacity of a node(number of CPUs and amount of memory, available disk space for data persistence). We can quickly figure out that node resources are finite. Node capacity is a part of the NodeStatus Object, and reports it's capacity at the time of node registration into the K8s cluster. There are two categories of node resources

  • Shareable: resources used by multiple processes are shareable ones like CPU or network bandwidth.
  • Non-Shareable: these are the incompressible compute resources such as memory or disk and scarcity of them leads to node instability.

The scarcity of the shareable resources leads to the throttling of the process; however, over-usage of non-shareable resources results in the execution of some maintenance programs like OOM Killer. System maintenance programs are compute-intensive and can stall the node for the time-being which results into the non-availablity of the node in a cluster. Instead of such behaviour, Kubelet takes the pro-active approach by monitoring the resources of the node and evict some pods in-case of starvation. For observing a resource usage per-node basis, the following commands can be helpful.

Kubectl describe node <node_name> //for knowing capacity, allocatable capacity and current usage of nodes.
kubectl top nodes                //for knowing current usage of all the nodes in a cluster.

Kubernetes eviction policy

Kubelet proactively monitors the node compute resources and in-case of starvation or exhaustion of the node compute resources, Kubelet will try to reclaim the resources by evicting(failing) the pods; it terminates all of its containers and transitions its PodPhase to Failed. Let's see the implementation of monitoring and eviction in Kubelet.

Eviction signals

Kubelet process configures the threshold strategy for the resources and breach of the threshold point triggers the eviction of the pod from the node. Currently, the table below shows the eviction signals:

Eviction Signal Description
memory.available memory.available := node.status.capacity[memory] - node.stats.memory.workingSet
nodefs.available nodefs.available := node.stats.fs.available
nodefs.inodesFree nodefs.inodesFree := node.stats.fs.inodesFree
imagefs.available imagefs.available := node.stats.runtime.imagefs.available
imagefs.inodesFree imagefs.inodesFree := node.stats.runtime.imagefs.inodesFree

Signals are either literals or percentage-based. The percentage value calculation depends on the total capacity of the node. As per the official Kubernetes documentation, Kubelet supports only two filesystem:

  • The nodefs filesystem that Kubelet uses for volumes, daemon logs, etc.
  • The imagefs filesystem that container runtimes use for storing images and container writable layers. It is optional, and Kubelet auto-discovers these filesystems using cAdvisor.

Kubelet does not care about any other filesystems.

Kubelet configures two categories of eviction signals, soft signals and the hard signals.

  • Soft eviction threshold is a combination of two values, i.e. the threshold limit of the configuration and the administrator specified grace period. The grace-period argument is mandatory for the Kubelet process. Also, in-case soft eviction threshold coincided, we can define the pod termination grace period too.

  • Hard eviction threshold is similar to a soft eviction, but without any grace period and if it reached Kubelet will immediately evict the pod from the node with no graceful termination. The kubelet has the following default hard eviction threshold:

    • memory.available < 100Mi
    • nodefs.available < 10%
    • nodefs.inodesFree < 5%
    • imagefs.available <1 5%
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment