Skip to content

Instantly share code, notes, and snippets.

@omad
Last active March 30, 2021 23:59
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save omad/82889a56f28e50b507e25cc07a753d0f to your computer and use it in GitHub Desktop.
Save omad/82889a56f28e50b507e25cc07a753d0f to your computer and use it in GitHub Desktop.
Kubernetes: Prototype for killing hung pods in a compute Job

Kubernetes: Prototype for killing hung pods in a compute Job

The Problem

The Plan Setup a livenessProbe command to check the CPU usage of the pod, if it drops below a threshold value for a threshold time, kill the Pod.

Status I've tested the shell script in a pod inside our k8s cluster, and it reports values accurately.

It's fairly inaccurate, but should be more than good enough for our purposes.

The 0.01 is a CPU usage percentage. If the usage over 5 seconds is below that, the script exit status is 1 (error), if it's above, exit status is 0 (success).

This check is run every 30s, starting 30s after the pod runs, and must fail 3 times in a row to kill the pod.

Notes See also:

Kind of based on https://unix.stackexchange.com/questions/450748/calculating-cpu-usage-of-a-cgroup-over-a-period-of-time

The returned value from cpuacct.usage is CPU usage in nanoseconds.

apiVersion: v1
kind: Pod
metadata:
labels:
test: liveness
name: liveness-exec
spec:
# Don't recreate the pod, if it was hung, we want it gone.
restartPolicy: Never
containers:
- name: liveness
image: k8s.gcr.io/busybox
args:
- run-my-proc
livenessProbe:
exec:
# Check CPU usage over 5 seconds is higher than 0.01%
command:
- /bin/sh
- -c
- cstart=$(cat /sys/fs/cgroup/cpu/cpuacct.usage); sleep 5; cstop=$(cat /sys/fs/cgroup/cpu/cpuacct.usage); awk "BEGIN {if ((($cstop - $cstart) / 5000000000) < 0.01) exit 1}";
# Check every 30s, and require 3 periods in a row to trigger the liveness killer
initialDelaySeconds: 30
periodSeconds: 30
failureThreshold: 3
timeoutSecond: 6
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment