Skip to content

Instantly share code, notes, and snippets.

@aiwantaozi
Last active December 20, 2018 23:42
Show Gist options
  • Save aiwantaozi/eaca696ad4a17df35cf4831f91c60c21 to your computer and use it in GitHub Desktop.
Save aiwantaozi/eaca696ad4a17df35cf4831f91c60c21 to your computer and use it in GitHub Desktop.
Cluster/Node/Workload/Pod Metric

CPU

CPULoad1

"cpu_load1"
represent the 1 min cpu load averages each core.
Scope: cluster, node
expr: 
    cluster: sum(node_load1) / count(node_cpu{mode="system"})
    node: sum(node_load1{instance=~"$node.*"}) / count(node_cpu{mode="system", instance=~"$node.*"})

CPULoad5

"_cpu_load5"
represent the 5 min cpu load averages each core.
Scope: cluster, node
expr: 
    cluster: sum(node_load1) / count(node_cpu{mode="system"})
    node: sum(node_load5{instance=~"$node.*"}) / count(node_cpu{mode="system", instance=~"$node.*"})

CPULoad15

"_cpu_load15"
represent the 15 min cpu load averages each core.
Scope: cluster, node
expr: 
    cluter: sum(node_load1) / count(node_cpu{mode="system"})
    node: sum(node_load15{instance=~"$node.*"}) / count(node_cpu{mode="system", instance=~"$node.*"})

CPUUsageSecondsSumRate

"_cpu_usage_seconds_sum_rate"
CPU time per second. There are several cpu mode, like user, system, nice, idle, iowait, guest, guest_nice, steal, soft_irq and irq. To calculate the amount of cpu utilization by host in your Kubernetes cluster we want to sum all the modes except for idle, iowait, guest, and guest_nice. 
Scope: cluster, node, workload, pod
expr: 
    cluster: sum(rate(node_cpu{mode!="idle", mode!="iowait", mode!~"^(?:guest.*)$"}[2m]))
    node: sum(rate(node_cpu{mode!="idle", mode!="iowait", mode!~"^(?:guest.*)$", instance=~"$node.*"}[2m]))
    workload/pod: sum(rate(container_cpu_usage_seconds_total{namespace="$namespace",pod_name=~"$pod.*"}[2m]))

CPUUserSecondsSumRate

"_cpu_user_seconds_sum_rate"
CPU time in user mode per second.
Scope: cluster, node, workload, pod 
expr: 
    cluster: sum(rate(node_cpu{mode!="user"}[2m]))
    node: sum(rate(node_cpu{mode!="user", instance=~"$node.*"}[2m]))
    workload/pod: sum(rate(container_cpu_user_seconds_total{namespace="$namespace",pod_name=~"$pod.*"}[2m]))

CPUSystemSecondsSumRate

"_cpu_system_seconds_sum_rate"
CPU time in system mode per second.
Scope: cluster, node, workload, pod 
expr: 
    cluster: sum(rate(node_cpu{mode!="user"}[2m]))
    node: sum(rate(node_cpu{mode!="system", instance=~"$node.*"}[2m]))
    workload/pod: sum(rate(container_cpu_system_seconds_total{namespace="$namespace",pod_name=~"$pod.*"}[2m]))

CPUCfsThrottledSecondsSumRate

"_cpu_cfs_throttled_seconds_sum_rate"
Scope: container
If you have defined what the upper limit of CPU usage can be, when a container exceeds its CPU limits, the Linux runtime will “throttle” the container and record the amount of time it was throttled.
expr: 
    container: sum(rate(container_cpu_cfs_throttled_seconds_total{namespace="%s",pod_name=~"%s.*"}[2m]))

Disk IO

DiskIOReadsBytesSumRate

"_disk_io_reads_bytes_sum_rate"
how much read from the disk per second.
Scope: cluster, node, workload, pod
expr: 
    cluster: sum ( rate(node_disk_bytes_read[2m]))
    node: sum ( rate(node_disk_bytes_read{instance=~"$node.*"}[2m]))
    workload/pod: sum(rate(container_fs_reads_bytes_total{namespace="$namespace", pod_name=~"$pod.*"}[2m]))

DiskIOWritesBytesSumRate

"_disk_io_writes_bytes_sum_rate"
how much write to the disk per second.
Scope: cluster, node, workload, pod
expr:
    cluster: sum ( rate(node_disk_bytes_write[2m]))
    node: sum ( rate(node_disk_bytes_written{instance=~"$node.*"}[2m]))
    workload/pod: sum(rate(container_fs_writes_bytes_total{namespace="$namespace", pod_name=~"$pod.*"}[2m]))

FileSystem

FsByteSum

"_fs_byte_sum"
used disk size.
Scope: workload, pod, 
expr:
    workload/pod: sum(container_fs_usage_bytes{namespace="%s", pod_name=~"%s.*"})
    container: sum(container_fs_usage_bytes{namespace="%s", container_name=~"%s"})

FsUsagePercent

"_fs_usage_percent"
Percentage of used disk capacity.
Scope: cluster, node
expr: 
    cluster: (sum(node_filesystem_size{mountpoint="/"}) - sum(node_filesystem_free{mountpoint="/"}))  / sum(node_filesystem_size{mountpoint="/"})
    node: (sum(node_filesystem_size{mountpoint="/", instance=~"%s.*"}) - sum(node_filesystem_free{mountpoint="/", instance=~"%s.*"}))  / sum(node_filesystem_size{mountpoint="/", instance=~"%s.*"})

Memory

MemoryUsagePercent

"_memory_usage_percent"
Percentage of used memory, used memory not include memory free, share code(buffers) and to cache disk pages(cached).
Scope: cluster, node, workload, pod
expr: 
    cluster: 1 - sum(node_memory_MemAvailable) / sum(node_memory_MemTotal)
    node: 1 - sum(node_memory_MemAvailable{instance=~"$node.*"}) / sum(node_memory_MemTotal{instance=~"$node.*"})
    workload/pod: sum(container_memory_working_set_bytes{namespace="$namespace", pod_name=~"$pod.*"}) / sum(label_join(kube_pod_container_resource_limits_memory_bytes{namespace="$namespace", pod=~"$pod.*"},

MemoryUsageBytesSum

"_memory_usage_bytes_sum"
Container current memory in working set, not include cache and buffer, this is what the OOM killer is watching for.
Scope: workload, pod, container
expr: 
    workload/pod: sum(container_memory_working_set_bytes{name!~"POD", namespace="%s",pod_name=~"%s.*"})

MemoryPageOutBytesSumRate

"_memory_page_out_bytes_sum_rate"
Page out per second. If there are too many demands on the memory system, the operating system will page out memory pages that have not been recently used. 
Scope: cluster, node
expr:
    cluster: 1e3 * sum((rate(node_vmstat_pgpgout[2m])))
    node: 1e3 * sum((rate(node_vmstat_pgpgout{instance=~"$node.*"}[2m])))

MemoryPageInBytesSumRate

"_memory_page_in_bytes_sum_rate"
Page in per second. One process which is running requested for a page that is not in the current memory, vhand daemon is bringing it's pages to memory 
Scope: cluster, node
expr:
    cluster: 1e3 * sum((rate(node_vmstat_pgpgin[2m])))
    node: 1e3 * sum((rate(node_vmstat_pgpgin{instance=~"$node.*"}[2m])))

Network

NetworkReceiveBytesSumRate

"_network_receive_bytes_sum_rate"
Receive bytes per second of all network interfaces
Scope: cluster, node, workload, pod
expr:
    cluster: sum ( rate(node_network_receive_bytes{device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*"}[2m]))
    node: sum ( rate(node_network_receive_bytes{device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*", instance=~"$node.*"}[2m]))
    workload/pod: sum(rate(container_network_receive_bytes_total{namespace="$namespace", pod_name=~"$pod.*"}[2m]))

NetworkReceivePacketsDroppedSumRate

"_network_receive_packets_dropped_sum_rate"
Number of packets enters the network stack of your computer, then gets dropped before the application receives it.
Scope: cluster, node, workload, pod
expr:
    cluster: sum ( rate(node_network_receive_drop{device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*"}[2m]))
    node: sum ( rate(node_network_receive_drop{device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*", instance=~"$node.*"}[2m]))
    workload/pod: sum(rate(container_network_receive_packets_dropped_total{namespace="$namespace", pod_name=~"$pod.*"}[2m]))

NetworkReceiveErrorsSumRate

"_network_receive_errors_sum_rate"
Number of packets transmittie with errors per second. This includes too-long-frames errors, ring-buffer overflow errors, crc errors, frame alignment errors, fifo overruns, and missed packets etc.
Scope: cluster, node, workload, pod
expr:
    cluster: sum ( rate(node_network_receive_errs{device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*"}[2m]))
    node: sum ( rate(node_network_receive_errs{device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*", instance=~"$node.*"}[2m]))
    workload/pod: sum(rate(container_network_receive_errors_total{namespace="$namespace", pod_name=~"$pod.*"}[2m]))

NetworkReceivePacketsSumRate

"_network_receive_packets_sum_rate"
number of packets received per second
Scope: cluster, node, workload, pod
expr:
    cluster: sum ( rate(node_network_receive_bytes{device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*"}[2m]))
    node: sum ( rate(node_network_receive_packets{device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*", instance=~"$node.*"}[2m]))
    workload/pod: sum(rate(container_network_receive_packets_total{namespace="$namespace", pod_name=~"$pod.*"}[2m]))

NetworkTransmitBytesSumRate

"_network_transmit_bytes_sum_rate"
Transmitted bytes per second of all network interfaces
Scope: cluster, node, workload, pod
expr:
    cluster: sum ( rate(node_network_transmit_bytes{device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*"}[2m]))
    node: sum ( rate(node_network_transmit_bytes{device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*", instance=~"$node.*"}[2m]))
    workload/pod: sum(rate(container_network_transmit_bytes_total{namespace="$namespace", pod_name=~"$pod.*"}[2m]))

NetworkTransmitPacketsDroppedSumRate

"_network_transmit_packets_dropped_sum_rate"
Number of packets dropped by the output queue per second.
Scope: cluster, node, workload, pod    
expr:
    cluster: sum ( rate(node_network_transmit_drop{device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*"}[2m]))
    node: sum ( rate(node_network_transmit_drop{device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*", instance=~"$node.*"}[2m]))
    workload/pod: sum(rate(container_network_transmit_packets_dropped_total{namespace="$namespace", pod_name=~"$pod.*"}[2m]))

NetworkTransmitErrorsSumRate

"_network_transmit_errors_sum_rate"
Number of packets transmittid with errors per second. present a summation of errors encountered while transmitting packets. This list includes errors due to the transmission being aborted, errors due to the carrier, fifo errors, heartbeat errors, and window errors etc.
Scope: cluster, node, workload, pod
expr:
    cluster: sum ( rate(node_network_transmit_errs{device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*"}[2m]))
    node: sum ( rate(node_network_transmit_errs{device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*", instance=~"$node.*"}[2m]))
    workload/pod: sum(rate(container_network_transmit_errors_total{namespace="$namespace", pod_name=~"$pod.*"}[2m]))

NetworkTransmitPacketsSumRate

"_network_transmit_packets_sum_rate"
number of packets transmit per second
Scope: cluster, node, workload, pod
expr:
    cluster: sum ( rate(node_network_transmit_packets{device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*"}[2m]))
    node: sum ( rate(node_network_transmit_packets{device!~"lo|veth.*|docker.*|flannel.*|cali.*|cbr.*", instance=~"$node.*"}[2m]))
    workload/pod: sum(rate(container_network_transmit_packets_total{namespace="$namespace", pod_name=~"$pod.*"}[2m]))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment