You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
"cpu_load1"
represent the 1 min cpu load averages each core.
Scope: cluster, node
expr:
cluster: sum(node_load1) / count(node_cpu{mode="system"})
node: sum(node_load1{instance=~"$node.*"}) / count(node_cpu{mode="system", instance=~"$node.*"})
CPULoad5
"_cpu_load5"
represent the 5 min cpu load averages each core.
Scope: cluster, node
expr:
cluster: sum(node_load1) / count(node_cpu{mode="system"})
node: sum(node_load5{instance=~"$node.*"}) / count(node_cpu{mode="system", instance=~"$node.*"})
CPULoad15
"_cpu_load15"
represent the 15 min cpu load averages each core.
Scope: cluster, node
expr:
cluter: sum(node_load1) / count(node_cpu{mode="system"})
node: sum(node_load15{instance=~"$node.*"}) / count(node_cpu{mode="system", instance=~"$node.*"})
CPUUsageSecondsSumRate
"_cpu_usage_seconds_sum_rate"
CPU time per second. There are several cpu mode, like user, system, nice, idle, iowait, guest, guest_nice, steal, soft_irq and irq. To calculate the amount of cpu utilization by host in your Kubernetes cluster we want to sum all the modes except for idle, iowait, guest, and guest_nice.
Scope: cluster, node, workload, pod
expr:
cluster: sum(rate(node_cpu{mode!="idle", mode!="iowait", mode!~"^(?:guest.*)$"}[2m]))
node: sum(rate(node_cpu{mode!="idle", mode!="iowait", mode!~"^(?:guest.*)$", instance=~"$node.*"}[2m]))
workload/pod: sum(rate(container_cpu_usage_seconds_total{namespace="$namespace",pod_name=~"$pod.*"}[2m]))
CPUUserSecondsSumRate
"_cpu_user_seconds_sum_rate"
CPU time in user mode per second.
Scope: cluster, node, workload, pod
expr:
cluster: sum(rate(node_cpu{mode!="user"}[2m]))
node: sum(rate(node_cpu{mode!="user", instance=~"$node.*"}[2m]))
workload/pod: sum(rate(container_cpu_user_seconds_total{namespace="$namespace",pod_name=~"$pod.*"}[2m]))
CPUSystemSecondsSumRate
"_cpu_system_seconds_sum_rate"
CPU time in system mode per second.
Scope: cluster, node, workload, pod
expr:
cluster: sum(rate(node_cpu{mode!="user"}[2m]))
node: sum(rate(node_cpu{mode!="system", instance=~"$node.*"}[2m]))
workload/pod: sum(rate(container_cpu_system_seconds_total{namespace="$namespace",pod_name=~"$pod.*"}[2m]))
CPUCfsThrottledSecondsSumRate
"_cpu_cfs_throttled_seconds_sum_rate"
Scope: container
If you have defined what the upper limit of CPU usage can be, when a container exceeds its CPU limits, the Linux runtime will “throttle” the container and record the amount of time it was throttled.
expr:
container: sum(rate(container_cpu_cfs_throttled_seconds_total{namespace="%s",pod_name=~"%s.*"}[2m]))
Disk IO
DiskIOReadsBytesSumRate
"_disk_io_reads_bytes_sum_rate"
how much read from the disk per second.
Scope: cluster, node, workload, pod
expr:
cluster: sum ( rate(node_disk_bytes_read[2m]))
node: sum ( rate(node_disk_bytes_read{instance=~"$node.*"}[2m]))
workload/pod: sum(rate(container_fs_reads_bytes_total{namespace="$namespace", pod_name=~"$pod.*"}[2m]))
DiskIOWritesBytesSumRate
"_disk_io_writes_bytes_sum_rate"
how much write to the disk per second.
Scope: cluster, node, workload, pod
expr:
cluster: sum ( rate(node_disk_bytes_write[2m]))
node: sum ( rate(node_disk_bytes_written{instance=~"$node.*"}[2m]))
workload/pod: sum(rate(container_fs_writes_bytes_total{namespace="$namespace", pod_name=~"$pod.*"}[2m]))
FileSystem
FsByteSum
"_fs_byte_sum"
used disk size.
Scope: workload, pod,
expr:
workload/pod: sum(container_fs_usage_bytes{namespace="%s", pod_name=~"%s.*"})
container: sum(container_fs_usage_bytes{namespace="%s", container_name=~"%s"})
FsUsagePercent
"_fs_usage_percent"
Percentage of used disk capacity.
Scope: cluster, node
expr:
cluster: (sum(node_filesystem_size{mountpoint="/"}) - sum(node_filesystem_free{mountpoint="/"})) / sum(node_filesystem_size{mountpoint="/"})
node: (sum(node_filesystem_size{mountpoint="/", instance=~"%s.*"}) - sum(node_filesystem_free{mountpoint="/", instance=~"%s.*"})) / sum(node_filesystem_size{mountpoint="/", instance=~"%s.*"})
Memory
MemoryUsagePercent
"_memory_usage_percent"
Percentage of used memory, used memory not include memory free, share code(buffers) and to cache disk pages(cached).
Scope: cluster, node, workload, pod
expr:
cluster: 1 - sum(node_memory_MemAvailable) / sum(node_memory_MemTotal)
node: 1 - sum(node_memory_MemAvailable{instance=~"$node.*"}) / sum(node_memory_MemTotal{instance=~"$node.*"})
workload/pod: sum(container_memory_working_set_bytes{namespace="$namespace", pod_name=~"$pod.*"}) / sum(label_join(kube_pod_container_resource_limits_memory_bytes{namespace="$namespace", pod=~"$pod.*"},
MemoryUsageBytesSum
"_memory_usage_bytes_sum"
Container current memory in working set, not include cache and buffer, this is what the OOM killer is watching for.
Scope: workload, pod, container
expr:
workload/pod: sum(container_memory_working_set_bytes{name!~"POD", namespace="%s",pod_name=~"%s.*"})
MemoryPageOutBytesSumRate
"_memory_page_out_bytes_sum_rate"
Page out per second. If there are too many demands on the memory system, the operating system will page out memory pages that have not been recently used.
Scope: cluster, node
expr:
cluster: 1e3 * sum((rate(node_vmstat_pgpgout[2m])))
node: 1e3 * sum((rate(node_vmstat_pgpgout{instance=~"$node.*"}[2m])))
MemoryPageInBytesSumRate
"_memory_page_in_bytes_sum_rate"
Page in per second. One process which is running requested for a page that is not in the current memory, vhand daemon is bringing it's pages to memory
Scope: cluster, node
expr:
cluster: 1e3 * sum((rate(node_vmstat_pgpgin[2m])))
node: 1e3 * sum((rate(node_vmstat_pgpgin{instance=~"$node.*"}[2m])))
Network
NetworkReceiveBytesSumRate
"_network_receive_bytes_sum_rate"
Receive bytes per second of all network interfaces
Scope: cluster, node, workload, pod
expr:
cluster: sum ( rate(node_network_receive_bytes{device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*"}[2m]))
node: sum ( rate(node_network_receive_bytes{device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*", instance=~"$node.*"}[2m]))
workload/pod: sum(rate(container_network_receive_bytes_total{namespace="$namespace", pod_name=~"$pod.*"}[2m]))
NetworkReceivePacketsDroppedSumRate
"_network_receive_packets_dropped_sum_rate"
Number of packets enters the network stack of your computer, then gets dropped before the application receives it.
Scope: cluster, node, workload, pod
expr:
cluster: sum ( rate(node_network_receive_drop{device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*"}[2m]))
node: sum ( rate(node_network_receive_drop{device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*", instance=~"$node.*"}[2m]))
workload/pod: sum(rate(container_network_receive_packets_dropped_total{namespace="$namespace", pod_name=~"$pod.*"}[2m]))
NetworkReceiveErrorsSumRate
"_network_receive_errors_sum_rate"
Number of packets transmittie with errors per second. This includes too-long-frames errors, ring-buffer overflow errors, crc errors, frame alignment errors, fifo overruns, and missed packets etc.
Scope: cluster, node, workload, pod
expr:
cluster: sum ( rate(node_network_receive_errs{device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*"}[2m]))
node: sum ( rate(node_network_receive_errs{device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*", instance=~"$node.*"}[2m]))
workload/pod: sum(rate(container_network_receive_errors_total{namespace="$namespace", pod_name=~"$pod.*"}[2m]))
NetworkReceivePacketsSumRate
"_network_receive_packets_sum_rate"
number of packets received per second
Scope: cluster, node, workload, pod
expr:
cluster: sum ( rate(node_network_receive_bytes{device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*"}[2m]))
node: sum ( rate(node_network_receive_packets{device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*", instance=~"$node.*"}[2m]))
workload/pod: sum(rate(container_network_receive_packets_total{namespace="$namespace", pod_name=~"$pod.*"}[2m]))
NetworkTransmitBytesSumRate
"_network_transmit_bytes_sum_rate"
Transmitted bytes per second of all network interfaces
Scope: cluster, node, workload, pod
expr:
cluster: sum ( rate(node_network_transmit_bytes{device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*"}[2m]))
node: sum ( rate(node_network_transmit_bytes{device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*", instance=~"$node.*"}[2m]))
workload/pod: sum(rate(container_network_transmit_bytes_total{namespace="$namespace", pod_name=~"$pod.*"}[2m]))
NetworkTransmitPacketsDroppedSumRate
"_network_transmit_packets_dropped_sum_rate"
Number of packets dropped by the output queue per second.
Scope: cluster, node, workload, pod
expr:
cluster: sum ( rate(node_network_transmit_drop{device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*"}[2m]))
node: sum ( rate(node_network_transmit_drop{device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*", instance=~"$node.*"}[2m]))
workload/pod: sum(rate(container_network_transmit_packets_dropped_total{namespace="$namespace", pod_name=~"$pod.*"}[2m]))
NetworkTransmitErrorsSumRate
"_network_transmit_errors_sum_rate"
Number of packets transmittid with errors per second. present a summation of errors encountered while transmitting packets. This list includes errors due to the transmission being aborted, errors due to the carrier, fifo errors, heartbeat errors, and window errors etc.
Scope: cluster, node, workload, pod
expr:
cluster: sum ( rate(node_network_transmit_errs{device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*"}[2m]))
node: sum ( rate(node_network_transmit_errs{device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*", instance=~"$node.*"}[2m]))
workload/pod: sum(rate(container_network_transmit_errors_total{namespace="$namespace", pod_name=~"$pod.*"}[2m]))
NetworkTransmitPacketsSumRate
"_network_transmit_packets_sum_rate"
number of packets transmit per second
Scope: cluster, node, workload, pod
expr:
cluster: sum ( rate(node_network_transmit_packets{device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*"}[2m]))
node: sum ( rate(node_network_transmit_packets{device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*", instance=~"$node.*"}[2m]))
workload/pod: sum(rate(container_network_transmit_packets_total{namespace="$namespace", pod_name=~"$pod.*"}[2m]))