buzztaiki/exec_probe_and_timeout_memo.md

## exec_probe_and_timeout_memo.md

      
    Raw
  

              exec_probe_and_timeout_memo.md
            
          
    exec probe と timeout メモ

以下の deployment のとき
---
apiVersion: v1
kind: Pod
  metadata:
    labels:
      app: myapp
  spec:
    containers:
      - name: myapp
        image: busybox
        command: [tail, -f, /dev/null]
        livenessProbe:
          exec:
            command: [sleep, 5]
          timeoutSeconds: 2
          periodSeconds: 10
          failureThreshold: 3
    terminationGracePeriodSeconds: 0

以下の events になって、再起動してくれない
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Pulling    4m47s                 kubelet            Pulling image "busybox"
  Normal   Pulled     4m45s                 kubelet            Successfully pulled image "busybox" in 1.958676639s
  Normal   Created    4m45s                 kubelet            Created container myapp
  Normal   Started    4m45s                 kubelet            Started container myapp
  Warning  Unhealthy  72s (x21 over 4m32s)  kubelet            Liveness probe errored: Rpc error: code = Unknown desc = deadline exceeded ("DeadlineExceeded"): context deadline exceeded

ps すると sleep が最大で3人くらい居たりする。なんでや。
普通の場合

command を ["false"] にすると以下の events になって普通に restart する
Events:
  Type     Reason     Age                 From               Message
  ----     ------     ----                ----               -------
  Normal   Pulled     2m7s                kubelet            Successfully pulled image "busybox" in 1.866816854s
  Normal   Pulled     95s                 kubelet            Successfully pulled image "busybox" in 1.943729073s
  Normal   Created    65s (x3 over 2m7s)  kubelet            Created container myapp
  Normal   Started    65s (x3 over 2m7s)  kubelet            Started container myapp
  Normal   Pulled     65s                 kubelet            Successfully pulled image "busybox" in 1.865027888s
  Warning  Unhealthy  39s (x9 over 119s)  kubelet            Liveness probe failed:
  Normal   Killing    39s (x3 over 99s)   kubelet            Container myapp failed liveness probe, will be restarted
  Normal   Pulling    37s (x4 over 2m9s)  kubelet            Pulling image "busybox"

ソース読む

ExecProbeTimeout feature が有効なら失敗になって restart してくれそうなんだが。
https://github.com/kubernetes/kubernetes/blob/7f8be71148f5461df9ae61b011c732d0ba2f551c/pkg/probe/exec/exec.go#L74-L77
			if utilfeature.DefaultFeatureGate.Enabled(features.ExecProbeTimeout) {
				// When exec probe timeout, data is empty, so we should return timeoutErr.Error() as the stdout.
				return probe.Failure, timeoutErr.Error(), nil
			}

https://github.com/kubernetes/kubernetes/blob/a12b886b1da059e0190c54d09c5eab5219dd7acf/pkg/features/kube_features.go#L939
	ExecProbeTimeout:                               {Default: true, PreRelease: featuregate.GA}, // lock to default and remove after v1.22 based on KEP #1972 update

errored だろうが、failed だろうが failure は返してる
https://github.com/kubernetes/kubernetes/blob/2f2240400391add53983c9c04cb91ec8a8df5c67/pkg/kubelet/prober/prober.go#L105-L111
		if err != nil {
			klog.V(1).ErrorS(err, "Probe errored", "probeType", probeType, "pod", klog.KObj(pod), "podUID", pod.UID, "containerName", container.Name)
			pb.recordContainerEvent(pod, &container, v1.EventTypeWarning, events.ContainerUnhealthy, "%s probe errored: %v", probeType, err)
		} else { // result != probe.Success
			klog.V(1).InfoS("Probe failed", "probeType", probeType, "pod", klog.KObj(pod), "podUID", pod.UID, "containerName", container.Name, "probeResult", result, "output", output)
			pb.recordContainerEvent(pod, &container, v1.EventTypeWarning, events.ContainerUnhealthy, "%s probe failed: %s", probeType, output)
		}

なんとなく、probe probess が残ってると restart しないって実装がどこかにありそうな気がしてる。
KEP と実装 PR


kubernetes/enhancements#1972
https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1972-kubelet-exec-probe-timeouts

こんな事が書いてある。プロセスが残るのはこれが理由？

Non-Goals

ensuring exec processes that timed out have been killed by kubelet.
introducing CRI errors for handling scenarios such as time ou


PR

kubernetes/kubernetes#94115