There are two rule files for Pod Failure of Openshift or Kubernetes in Prometheus
- One is:-
- name: pod.rules
rules:
- alert: Pod is failed
expr: sum (rate (container_total_failed{image!="",name=~"^k8s_.*",kubernetes_io_hostname=~"^$Node$"}[1m])) by (pod_name)
for: 1m
labels:
severity: warning
annotations:
description: The Kubelet pod has been failed
summary: Node status is Failed
- Second is :-
groups:
- name: pod.rules
rules:
- alert: Pod is Failed
expr: sum(container_tasks_state{container_name!="POD",pod_name!="",state!="failed",status!="true"}) > 0
for: 1m
labels:
severity: warning
annotations:
description: The Kubelet pod has been failed
summary: Node status is Failed
- Third one is:-
groups:
- name: pod.rules
rules:
- alert: Pod is Failed
expr: absent(((time() - container_last_seen{name=""}) < 5))
for: 1m
labels:
severity: warning
annotations:
description: The Kubelet pod has been failed
summary: Node status is Failed
- Fourth one is:-
groups:
- name: pod.rules
rules:
- alert: Pod is Failed
expr: sum((kube_pod_status_phase{phase="Failed"})) > 0
for: 1m
labels:
severity: warning
annotations:
description: "{{$labels.pod}} in {{$labels.namespace}} is Failed!"
summary: Pod Status is Failed