Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save vishiy/05626ae3afbc93d889cda7795e00629b to your computer and use it in GitHub Desktop.
Save vishiy/05626ae3afbc93d889cda7795e00629b to your computer and use it in GitHub Desktop.
Gist for metrics collected by Azure monitor for containers
MetricCategory MetricName MetricDimensions MetricType MetricTable MetricNamespace MetricOrigin Comments
Node-CPU cpuAllocatableNanoCores Objectname='K8SNode', Instancename=<nodename> Gauge Perf Amount of cpu that is allocatable by Kubernetes to run pods, expressed in nanocores/nanocpu unit
Node-CPU cpuCapacityNanocores Objectname='K8SNode', Instancename=<nodename> Gauge Perf Total CPU capacity of the node in nanocore/nanocpu unit
Node-CPU cpuUsageNanocores Objectname='K8SNode', Instancename=<nodename> Gauge Perf CPU used by node in nanocore/nanocpu unit
Node-Memory memoryAllocatableBytes Objectname='K8SNode', Instancename=<nodename> Gauge Perf Amount of memory in bytes that is allocatable by kubernetes to run pods
Node-Memory memoryCapacityBytes Objectname='K8SNode', Instancename=<nodename> Gauge Perf Total memory capacity of the node in bytes
Node-Memory memoryRssBytes Objectname='K8SNode', Instancename=<nodename> Gauge Perf Rss memory used by the node in bytes. Collected only for Linux nodes
Node-Memory memoryWorkingSetBytes Objectname='K8SNode', Instancename=<nodename> Gauge Perf Working set memory used by the node in bytes
Node-Other restartTimeEpoch Objectname='K8SNode', Instancename=<nodename> Gauge Perf Last time node restarted in epoch seconds
Node-DiskUsage free device,hostName,path,clusterId,clusterName Gauge InsightsMetrics container.azm.ms/disk container.azm.ms/telegraf Free disk space in bytes (excludes --tmpfs, devtmpfs, devfs, overlay, aufs, squashfs)
Node-DiskUsage used device,hostName,path,clusterId,clusterName Gauge InsightsMetrics container.azm.ms/disk container.azm.ms/telegraf Used disk space in bytes (excludes --tmpfs, devtmpfs, devfs, overlay, aufs, squashfs)
Node-DiskUsage used_percent device,hostName,path,clusterId,clusterName Gauge InsightsMetrics container.azm.ms/disk container.azm.ms/telegraf Used disk space in percentage (excludes--tmpfs, devtmpfs, devfs, overlay, aufs, squashfs)
Node-DiskIO reads hostName,name,clusterId,clusterName Counter InsightsMetrics container.azm.ms/diskio container.azm.ms/telegraf Number of reads (incremented when I/O request completes)--(filtered for devices having names with the regex pattern "sd[a-z][0-9]")
Node-DiskIO read_bytes hostName,name,clusterId,clusterName Counter InsightsMetrics container.azm.ms/diskio container.azm.ms/telegraf Number of bytes read from the block device--(filtered for devices having names with the regex pattern "sd[a-z][0-9]")
Node-DiskIO read_time hostName,name,clusterId,clusterName Counter InsightsMetrics container.azm.ms/diskio container.azm.ms/telegraf Number of milliseconds that read requests have waited on the block device--(filtered for devices having names with the regex pattern "sd[a-z][0-9]")
Node-DiskIO writes hostName,name,clusterId,clusterName Counter InsightsMetrics container.azm.ms/diskio container.azm.ms/telegraf Number of writes (incremented when I/O request completes)--(filtered for devices having names with the regex pattern "sd[a-z][0-9]")
Node-DiskIO write_bytes hostName,name,clusterId,clusterName Counter InsightsMetrics container.azm.ms/diskio container.azm.ms/telegraf Number of bytes written to the block device--(filtered for devices having names with the regex pattern "sd[a-z][0-9]")
Node-DiskIO write_time hostName,name,clusterId,clusterName Counter InsightsMetrics container.azm.ms/diskio container.azm.ms/telegraf Number of milliseconds that write requests have waited on the block device--(filtered for devices having names with the regex pattern "sd[a-z][0-9]")
Node-DiskIO io_time hostName,name,clusterId,clusterName Counter InsightsMetrics container.azm.ms/diskio container.azm.ms/telegraf Number of milliseconds during which the device has had I/O requests queued--(filtered for devices having names with the regex pattern "sd[a-z][0-9]")
Node-DiskIO iops_in_progress hostName,name,clusterId,clusterName Counter InsightsMetrics container.azm.ms/diskio container.azm.ms/telegraf Number of I/O requests that have been issued to device driver but have not yet completed--(filtered for devices having names with the regex pattern "sd[a-z][0-9]")
Node-GPU nodeGpuAllocatable gpuVendor,Computer,clusterId,clusterName Gauge InsightsMetrics container.azm.ms/gpu container.azm.ms Number of allocatable GPUs in the node at any point in time
Node-GPU nodeGPUCapacity gpuVendor,Computer,clusterId,clusterName Gauge InsightsMetrics container.azm.ms/gpu container.azm.ms Total number of GPUs in the node
Node-Network bytes_recv hostName,interface,clusterId,clusterName Counter InsightsMetrics container.azm.ms/net container.azm.ms/telegraf Total number of bytes received by the interface
Node-Network bytes_sent hostName,interface,clusterId,clusterName Counter InsightsMetrics container.azm.ms/net container.azm.ms/telegraf Total number of bytes sent by the interface
Node-Network err_in hostName,interface,clusterId,clusterName Counter InsightsMetrics container.azm.ms/net container.azm.ms/telegraf Total number of receive errors detected by the interface
Node-Network err_out hostName,interface,clusterId,clusterName Counter InsightsMetrics container.azm.ms/net container.azm.ms/telegraf Total number of transmit errors detected by the interface
Container-CPU cpuRequestNanoCores Objectname='K8SContainer', Instancename=podUID/containerName Gauge Perf Container's cpu request in nanocore/nanocpu unit
Container-CPU cpuLimitNanoCores Objectname='K8SContainer', Instancename=podUID/containerName Gauge Perf Container's cpu limit in nanocore/nanocpu unit. If limits are not specified, node's capacity will be rolled-up as container's limit
Container-CPU cpuUsageNanoCores Objectname='K8SContainer', Instancename=podUID/containerName Gauge Perf Container's CPU usage in nanocore/nanocpu unit
Container-Memory memoryRequestBytes Objectname='K8SContainer', Instancename=podUID/containerName Gauge Perf Container's memory request in bytes
Container-Memory memoryLimitBytes Objectname='K8SContainer', Instancename=podUID/containerName Gauge Perf Container's memory limit in bytes. If limits are not specified, node's capacity will be rolled-up as container's limit
Container-Memory memoryRssBytes Objectname='K8SContainer', Instancename=podUID/containerName Gauge Perf Container's rss memory usage in bytes. Collected only for containers running in Linux nodes
Container-Memory memoryWorkingSetBytes Objectname='K8SContainer', Instancename=podUID/containerName Gauge Perf Container's working set memory usage in bytes
Container-Other restartTimeEpoch Objectname='K8SContainer', Instancename=podUID/containerName Gauge Perf Last time the container restarted in epoch seconds
Container-GPU containerGpuRequests containerName=podUID/containerName,clusterId,clusterName Gauge InsightsMetrics container.azm.ms/gpu container.azm.ms Number of GPUs requested by the container
Container-GPU containerGpuLimits containerName=podUID/containerName,clusterId,clusterName Gauge InsightsMetrics container.azm.ms/gpu container.azm.ms Container's GPU limit
Container-GPU containerGpuDutyCycle containerName=podUID/containerName,gpuId,gpuModel,gpuVendor,clusterId,clusterName Gauge InsightsMetrics container.azm.ms/gpu container.azm.ms Percentage of time over the past sample period during which GPU was busy/actively processing for a container. Duty cycle is a number between 1 and 100
Container-GPU containerGpumemoryTotalBytes containerName=podUID/containerName,gpuId,gpuModel,gpuVendor,clusterId,clusterName Gauge InsightsMetrics container.azm.ms/gpu container.azm.ms Total GPU memory available for the container
Container-GPU containerGpumemoryUsedBytes containerName=podUID/containerName,gpuId,gpuModel,gpuVendor,clusterId,clusterName Gauge InsightsMetrics container.azm.ms/gpu container.azm.ms GPU memory used by the container
Pod-PV pvUsedBytes podUID,podName,podNamespace,pvName,pvcName,pvCapacityBytes,clusterId,clusterName Gauge InsightsMetrics container.azm.ms/pv container.azm.ms Used space in bytes for a specific PV consumed by a specific Pod
Controller-Deployments kube_deployment_status_replicas_ready creationTime,deployment,deploymentStrategy,k8sNamespace,spec_replicas,status_replicas_available,status_replicas_updated,clusterId,clusterName Gauge InsightsMetrics container.azm.ms/kubestate container.azm.ms Total number of ready pods targeted by deployment (status.readyReplicas)
Controller-HPA kube_hpa_status_current_replicas creationTime,hpa,k8sNamespace,lastScaleTime,spec_max_replicas,spec_min_replicas,status_desired_replicas,targetKind,targetName Gauge InsightsMetrics container.azm.ms/kubestate container.azm.ms Current number of replicas of pods managed by this autoscaler (status.currentReplicas)
Kubelet kubelet_docker_operations hostName,operation_type,scrapeUrl,clusterId,clusterName Counter InsightsMetrics container.azm.ms/prometheus Cumulative number of Docker operations by operation type
Kubelet kubelet_docker_operations_errors hostName,operation_type,scrapeUrl,clusterId,clusterName Counter InsightsMetrics container.azm.ms/prometheus container.azm.ms/telegraf Cumulative number of Docker operation errors by operation type
Kubelet kubelet_running_pod_count hostName,scrapeUrl,clusterId,clusterName Gauge InsightsMetrics container.azm.ms/prometheus container.azm.ms/telegraf Number of pods currently running
Kubelet volume_manager_total_volumes hostname,plugin_name,scrapeUrl,state,clusterId,clusterName Gauge InsightsMetrics container.azm.ms/prometheus container.azm.ms/telegraf Number of volumes in Volume Manager
Kubelet kubelet_node_config_error hostName,scrapeUrl,clusterId,clusterName Gauge InsightsMetrics container.azm.ms/prometheus container.azm.ms/telegraf This metric is true (1) if the node is experiencing a configuration-related error, false (0) otherwise
Kubelet process_resident_memory_bytes hostName,scrapeUrl,clusterId,clusterName Gauge InsightsMetrics container.azm.ms/prometheus container.azm.ms/telegraf Kubelet's resident memory size in bytes
Kubelet process_cpu_seconds_total hostName,scrapeUrl,clusterId,clusterName Counter InsightsMetrics container.azm.ms/prometheus container.azm.ms/telegraf Kubelet's total user and system CPU time spent in seconds
MetricCategory MetricName MetricDimensions MetricType MetricNamespace Comments
Node-CPU cpuUsageMillicores host Gauge insights.container/nodes CPU used by node in millicore units
Node-CPU cpuUsagePercentage host Gauge insights.container/nodes CPU used by node in percentage unit
Node-Memory memoryRssBytes host Gauge insights.container/nodes Rss memory used by the node in bytes (only for Linux nodes)
Node-Memory memoryRssPercentage host Gauge insights.container/nodes Rss memory used by the node in percentage unit (only for Linux nodes)
Node-Memory memoryWorkingSetBytes host Gauge insights.container/nodes Working set memory used by the node in bytes
Node-Memory memoryWorkingSetPercentage host Gauge insights.container/nodes Working set memory used by the node in percentage units
Node-Other nodesCount Status Gauge insights.container/nodes Count of nodes by last know status
Node-DiskUsage diskUsedPercentage device,host Gauge insights.container/nodes Used disk space per disk in percentage
Container-CPU cpuExceededPercentage containerName,controllerName,Kubernetes namespace,podName,thresholdPercentage Gauge insights.container/container Container's CPU usage exceeded threshold % [default threshold is 95% and configurable]
Container-Memory memoryRssExceededPercentage containerName,controllerName,Kubernetes namespace,podName,thresholdPercentage Gauge insights.container/container Container's rss memory usage exceeded threshold % [default threshold is 95% and configurable] (only for Linux nodes)
Container-Memory memoryWorkingSetExceededPercentage containerName,controllerName,Kubernetes namespace,podName,thresholdPercentage Gauge insights.container/container Container's workingset memory usage exceeded threshold % [default threshold is 95% and configurable] (only for Linux nodes)
Pod podCount controllerName, Kubernetes namespace,node,phase Gauge insights.container/pod Count of pods by namespace, controller & phase
Pod podReadyPercentage controllerName, Kubernetes namespace Gauge insights.container/pod Percentage of pods in 'ready' state by namespace & controller
Pod oomKilledContainerCount controllerName, Kubernetes namespace Gauge insights.container/pod Count of OOM killed containers by namespace & controller
Pod restartingContainerCount controllerName, Kubernetes namespace Gauge insights.container/pod Count of container restarts by controller & namespace
Pod completedJobsCount controllerName, Kubernetes namespace Gauge insights.container/pod Count of completed jobs (that are yet to be cleaned up) in the past n hours (default n=6) by namespace & controller. 6 hours default threshol d is configurable
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment