以下の deployment のとき
---
apiVersion: v1
kind: Pod
metadata:
labels:
app: myapp
spec:
containers:
- name: myapp
image: busybox
command: [tail, -f, /dev/null]
livenessProbe:
exec:
command: [sleep, 5]
timeoutSeconds: 2
periodSeconds: 10
failureThreshold: 3
terminationGracePeriodSeconds: 0
以下の events になって、再起動してくれない
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulling 4m47s kubelet Pulling image "busybox"
Normal Pulled 4m45s kubelet Successfully pulled image "busybox" in 1.958676639s
Normal Created 4m45s kubelet Created container myapp
Normal Started 4m45s kubelet Started container myapp
Warning Unhealthy 72s (x21 over 4m32s) kubelet Liveness probe errored: Rpc error: code = Unknown desc = deadline exceeded ("DeadlineExceeded"): context deadline exceeded
ps すると sleep が最大で3人くらい居たりする。なんでや。
command を ["false"]
にすると以下の events になって普通に restart する
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 2m7s kubelet Successfully pulled image "busybox" in 1.866816854s
Normal Pulled 95s kubelet Successfully pulled image "busybox" in 1.943729073s
Normal Created 65s (x3 over 2m7s) kubelet Created container myapp
Normal Started 65s (x3 over 2m7s) kubelet Started container myapp
Normal Pulled 65s kubelet Successfully pulled image "busybox" in 1.865027888s
Warning Unhealthy 39s (x9 over 119s) kubelet Liveness probe failed:
Normal Killing 39s (x3 over 99s) kubelet Container myapp failed liveness probe, will be restarted
Normal Pulling 37s (x4 over 2m9s) kubelet Pulling image "busybox"
ExecProbeTimeout feature が有効なら失敗になって restart してくれそうなんだが。
if utilfeature.DefaultFeatureGate.Enabled(features.ExecProbeTimeout) {
// When exec probe timeout, data is empty, so we should return timeoutErr.Error() as the stdout.
return probe.Failure, timeoutErr.Error(), nil
}
ExecProbeTimeout: {Default: true, PreRelease: featuregate.GA}, // lock to default and remove after v1.22 based on KEP #1972 update
errored だろうが、failed だろうが failure は返してる https://github.com/kubernetes/kubernetes/blob/2f2240400391add53983c9c04cb91ec8a8df5c67/pkg/kubelet/prober/prober.go#L105-L111
if err != nil {
klog.V(1).ErrorS(err, "Probe errored", "probeType", probeType, "pod", klog.KObj(pod), "podUID", pod.UID, "containerName", container.Name)
pb.recordContainerEvent(pod, &container, v1.EventTypeWarning, events.ContainerUnhealthy, "%s probe errored: %v", probeType, err)
} else { // result != probe.Success
klog.V(1).InfoS("Probe failed", "probeType", probeType, "pod", klog.KObj(pod), "podUID", pod.UID, "containerName", container.Name, "probeResult", result, "output", output)
pb.recordContainerEvent(pod, &container, v1.EventTypeWarning, events.ContainerUnhealthy, "%s probe failed: %s", probeType, output)
}
なんとなく、probe probess が残ってると restart しないって実装がどこかにありそうな気がしてる。
- kubernetes/enhancements#1972
- https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1972-kubelet-exec-probe-timeouts
こんな事が書いてある。プロセスが残るのはこれが理由?
Non-Goals
- ensuring exec processes that timed out have been killed by kubelet.
- introducing CRI errors for handling scenarios such as time ou
PR