Skip to content

Instantly share code, notes, and snippets.

@luckylittle
Last active March 30, 2024 11:39
Show Gist options
  • Star 11 You must be signed in to star a gist
  • Fork 10 You must be signed in to fork a gist
  • Save luckylittle/68ea588d1242cff04765bbeddbed8fef to your computer and use it in GitHub Desktop.
Save luckylittle/68ea588d1242cff04765bbeddbed8fef to your computer and use it in GitHub Desktop.
Prometheus Certified Associate (PCA)

Mock Exam

1

Q1. The metric node_cpu_temp_celcius reports the current temperature of a nodes CPU in celsius. What query will return the average temperature across all CPUs on a per node basis? The query should return {instance=“node1”} 23.5 //average temp across all CPUs on node1 {instance=“node2”} 33.5 //average temp across all CPUs on node2.

node_cpu_temp_celsius{instance="node1", cpu="0"} 28
node_cpu_temp_celsius{instance="node1", cpu="1"} 19
node_cpu_temp_celsius{instance="node2", cpu="0"} 36
node_cpu_temp_celsius{instance="node2", cpu="1"} 31

A1: `avg by(instance) (node_cpu_temp_celsius)

Q2: What method does Prometheus use to collect metrics from targets? A2: pull

Q3: An engineer forgot to address an alert, based off the alertmanager config below, how long will they need to wait to see the alert again?

route:
  receiver: pager
  group_by: [alertname]
  group_wait: 10s
  repeat_interval: 4h
  group_interval: 5m
  routes:
    - match:
        team: api
      receiver: api-pager
    - match:
        team: frontend
      receiver: frontend-pager

A3: 4h

Q4: Which query below will get all time series for metric node_disk_read_bytes_total for job=web, and job=node? A4: node_disk_read_bytes_total{job=~"web|node"}

Q5: What type of database does Prometheus use? A5: Time Series

Q6: Analyze the alertmanager configs below. For all the alerts that got generated, how many total notifications will be sent out?

route:
  receiver: general-email
  group_by: [alertname]
  routes:
    - receiver: frontend-email
      group_by: [env]
      matchers:
        - team: frontend

The following alerts get generated by Prometheus with the defined labels.
alert1
team: frontend
env: dev

alert2team: frontend
env: dev

alert3
team: frontend
env: prod

alert4
team: frontend
env: prod

alert5
team: frontend
env: staging

A6: 3

Q7: What is the Prometheus client library used for? A7: Instrumenting applications to generate prometheus metrics and to push metrics to the Push Gateway

Q8: Management has decided to offer a file upload service where the SLO states that 97% of all upload should complete within 30s. A histogram metric is configured to track the upload time, which of the following bucket configurations is recommended for the desired SLO? A8: 10, 25, 27, 30, 32, 35, 49, 50 [since histogram quantiles are approximations, to find out if a SLO has been met make sure that a bucket is specified at the desired SLO value]

Q9: Which of the following is not a valid method for reloading alertmanager configuration? A9: hit the reload config button in alertmanager web ui

Q10: What two labels are assigned to every metric by default? A10: instance, job

Q11: What configuration will make it so Prometheus doesn’t scrape targets with a label of team: frontend?

#Option A:
relabel_configs:
  - source_labels: [team]
    regex: frontend
    action: drop

#Option B:
relabel_configs:
  - source_labels: [frontend]
    regex: team
    action: drop

#Option C:
metric_relabel_configs:
  - source_labels: [team]
    regex: frontend
    action: drop

#Option D:
relabel_configs:
  - match: [team]
    regex: frontend
    action: drop

A11: Option A [relabel_configs is where you will define which targets Prometheus should scrape]

Q12: Where should alerting rules be defined?

scrape_configs:
  - job_name: example
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: database_errors_total
        action: replace
        target_label: __name__
        replacement: database_failures_total

A12: separate rules file

Q13: Which query below will give the 99% quantile of the metric http_requests_total? A13: histogram_quantile(0.99, http_requests_total_bucket)

Q14: What metric should be used to track the uptime of a server? A14: counter

Q15: Which component of the Prometheus architecture should be used to collect metrics of short-lived jobs? A15: push gateway

Q16: What is the purpose of Prometheus scrape_interval? A16: Defines how frequently to scrape a target

Q17: What does the following metric_relabel_config do?

scrape_configs:
  - job_name: example
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: database_errors_total
        action: replace
        target_label: __name__
        replacement: database_failures_total

A17: Renames the metric database_errors_total to database_failures_total

Q18: Which component of the Prometheus architecture should be used to automatically discover all nodes in a Kubernetes cluster? A18: service discovery

Q19: For a histogram metric, what are the different submetrics? A19: __count [total number of observations], __bucket [number of observations for a specific bucket], __sum [sum of all observations]

Q20: What is the default web port of Prometheus? A20: 9090

Q21: Add an annotation to the alert called description that will print out the message that looks like this Instance has low disk space on filesystem, current free space is at %

groups:
  - name: node
    rules:
      - alert: node_filesystem_free_percent
        expr: 100 * node_filesystem_free_bytes{job="node"} / node_filesystem_size_bytes{job="node"} < 10

## Examples of the two metrics used in the alert can be seen below.

# node_filesystem_free_bytes{device="/dev/sda3", fstype="ext4", instance="node1", job="web", mountpoint="/home"}

# node_filesystem_size_bytes{device="/dev/sda3", fstype="ext4", instance="nodde1", job="web", mountpoint="/home"}

# Choose the correct answer:
# Option A:
description: Instance << $Labels.instance >> has low disk space on filesystem << $Labels.mountpoint >>, current free space is at << .Value >>%

# Option B:
description: Instance {{ .Labels.instance }} has low disk space on filesystem {{ .Labels.mountpoint }}, current free space is at {{ .Value }}%

# Option C:
description: Instance {{ .Labels=instance }} has low disk space on filesystem {{ .Labels=mountpoint }}, current free space is at {{ .Value }}%

# Option D:
description: Instance {{ .instance }} has low disk space on filesystem {{ .mountpoint }}, current free space is at {{ .Value }}%

A21: Option B

Q22: What does the double underscore __ before a label name signify? A22: The label is reserved label

Q23: The metric http_errors_total has 3 labels, path, method, error. Which of the following queries will give the total number of errors for a path of /auth, method of POST, and error code of 401? A23: http_errors_total{path="/auth", method="POST", code="401"}

Q24: What are the different states a Prometheus alert can be in? A24: inactive, pending, firing

Q25: Which of the following components is responsible for collecting metrics from an instance and exposing them in a format Prometheus expects? A25: exporters

Q26: Which of the following is not a valid time value to be used in a range selector? A26: 2mo

Q27: Analyze the example alertmanager configs and determine when an alert with the following labels arrives on alertmanager, what receiver will it send the alert to team: api and severity: critical?

route:
  receiver: general-email
  routes:
    - receiver: frontend-email
      matchers:
        - team: frontend
      routes:
        - matchers:
            severity: critical
          receiver: frontend-pager
    - receiver: backend-email
      matchers:
        - team: backend
      routes:
        - matchers:
            severity: critical
          receiver: backend-pager
    - receiver: auth-email
      matchers:
        - team: auth
      routes:
        - matchers:
            severity: critical
          receiver: auth-pager
  receiver: auth-pager

A27: general-email

Q28: A metric to track requests to an api http_requests_total is created. Which of the following would not be a good choice for a label? A28: email

Q29: Which query below will return a range vector? A29: node_boot_time_seconds[5m]

Q30: Based off the metrics below, which query will return the same result as the query database_write_timeouts / ignoring(error) database_error_total

database_write_timeouts{instance="db1", job="db", error="212, type="mysql"} 12
database_error_total{instance="db1", job="db", type="mysql"} 67

A30: database_write_timeouts / on(instance, job, type) database_error_total

Q31: What is the purpose of the for attribute in a Prometheus alert rule? A31: Determines how long a rule must be true before firing an alert

Q32: Which query will give sum of all filesystems on the machine? The metric node_filesystem_size_bytes will list out all of the filesystems and their total size.

node_filesystem_size_bytes{device="/dev/sda2", fstype="vfat", instance="192.168.1.168:9100", mountpoint="/boot/efi"} 536834048
node_filesystem_size_bytes{device="/dev/sda3", fstype="ext4", instance="192.168.1.168:9100", mountpoint="/"} 13268975616
node_filesystem_size_bytes{device="tmpfs", fstype="tmpfs", instance="192.168.1.168:9100", mountpoint="/run"} 727924736
node_filesystem_size_bytes{device="tmpfs", fstype="tmpfs", instance="192.168.1.168:9100", mountpoint="/run/lock"} 5242880
node_filesystem_size_bytes{device="tmpfs", fstype="tmpfs", instance="192.168.1.168:9100", mountpoint="/run/snapd/ns"} 727924736
node_filesystem_size_bytes{device="tmpfs", fstype="tmpfs", instance="192.168.1.168:9100", mountpoint="/run/user/1000"} 727920640

A32: sum(node_filesystem_size_bytes{instance="192.168.1.168:9100"})

Q33: What are the 3 components of the prometheus server? A33: retrieval node, tsdb, http server

Q34: What selector will match on time series whose mountpoint label doesn’t start with /run?

node_filesystem_avail_bytes{device="/dev/sda2", fstype="vfat", instance="node1", mountpoint="/boot/efi"}​
node_filesystem_avail_bytes{device="/dev/sda2", fstype="vfat", instance="node2", mountpoint="/boot/efi"}​
node_filesystem_avail_bytes{device="/dev/sda3", fstype="ext4", instance="node1", mountpoint="/"}​
node_filesystem_avail_bytes{device="/dev/sda3", fstype="ext4", instance="node2", mountpoint="/"}​
node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node1", mountpoint="/run"}​
node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node1", mountpoint="/run/lock"}​
node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node1", mountpoint="/run/snapd/ns"}​
node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node1", mountpoint="/run/user/1000"}​
node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node2", mountpoint="/run"}​
node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node2", mountpoint="/run/lock"}​
node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node2", mountpoint="/run/snapd/ns"}​
node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node2", mountpoint="/run/user/1000"}

A34: node_filesysten_avail_bytes{mountpoint!~"/run.*"}

Q35: Which statement is true about the rate/irate functions? A35: rate() calculates average rate over entire interval, irate() calculates the rate only between the last two datapoints in an interval

Q36: What is the default path Prometheus will scrape to collect metrics? A36: /metrics

Q37: The following PromQL expression is trying to divide the the node_filesystem_avail_bytes by node_filesystem_size_bytes , and node_filesystem_avail_bytes / node_filesystem_size_bytes. The PromQL expression does not return any results, fix the expression so that it successfully divides the two metric. This is what the two metrics look like before the division operation:

node_filesystem_avail_bytes{device="/dev/sda2", fstype="vfat", class=”SSD” instance="192.168.1.168:9100", job="test", mountpoint="/boot/efi"}

node_filesystem_size_bytes{device="/dev/sda2", fstype="vfat", instance="192.168.1.168:9100", job="test", mountpoint="/boot/efi"}

A37: node_filesystem_avail_bytes / ignoring(class) node_filesystem_size_bytes

Q38: What are the 3 components of observability? A38: logging, metrics, traces

Q39: Which of the following statements are true regarding Alert labels and annotations?

route:
  receiver: staff
  group_by: ['severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  routes:
    - matchers:
        job: kubernetes
      receiver: infra
      group_by: ['severity']

A39: Alert labels can be used as metadata so alertmanager can match on them and perform routing policies, whereas annotations should be used for cosmetic descriptions of the alerts

Q40: The metric http_errors_total{code=”404”} tracks the number of 404 errors a web server has seen. Which query returns what is the average rate of 404s a server has seen for the past 2 hours? Use a 2m sample range and a query interval of 1m: A40: avg_over_time(rate(http_errors_total{code="404"}[2m]) [2h:1m]) [since we need the average for the past 2 hours, the first value in the subquery will be 2h and the second number is the query interval]

Q41: Which query will return all time series for the metric node_network_transmit_drop_total this is greater than 20 and less than 100? A41: node_network_transmit_drop_total > 20 and node_network_transmit_drop_total < 100

Q42: What does the following metric_relabel_config do?

scrape_configs:
  - job_name: example
    metric_relabel_configs:
      - source_labels: [datacenter]
        regex: (.*)
        action: replace
        target_label: location
        replacement: dc-$1

A42: changes the datacenter label to location and prepends the value with dc-

Q43: What type of data should Prometheus monitor? A43: numeric

Q44: Which type of observability would be used to track a request/transaction as it traverses a system? A44: traces

Q45: Add an annotation to the alert called description that will print out the message that looks like this Instance has low disk space on filesystem , current free space is at %

groups:
  - name: node
    rules:
      - alert: node_filesystem_free_percent
        expr: 100 * node_filesystem_free_bytes{job="node"} / node_filesystem_size_bytes{job="node"} < 10

# Examples of the two metrics used in the alert can be seen below
# node_filesystem_free_bytes{device="/dev/sda3", fstype="ext4", instance="node1", job="web", mountpoint="/home"}
# node_filesystem_size_bytes{device="/dev/sda3", fstype="ext4", instance="nodde1", job="web", mountpoint="/home"}

# Choose the correct option:

#Option A:
description: Instance << $Labels.instance >> has low disk space on filesystem << $Labels.mountpoint >>, current free space is at << .Value >>%

#Option B:
description: Instance {{ .Labels.instance }} has low disk space on filesystem {{ .Labels.mountpoint }}, current free space is at {{ .Value }}%

#Option C:
description: Instance {{ .Labels=instance }} has low disk space on filesystem {{ .Labels=mountpoint }}, current free space is at {{ .Value }}%

#Option D:
description: Instance {{ .instance }} has low disk space on filesystem {{ .mountpoint }}, current free space is at {{ .Value }}%

A45: Option B

Q46: Regarding histogram and summary metrics, which of the following are true? A46: histogram is calculated server side and summary is calculated client side [for histograms, quantiles must be calculated server side thus they are less taxin on client libraries, whereas sumary metrics are the opposite]

Q47: What is this an example of? `Service provider guaranteed 99.999% uptime each month or else customer will be awarded $10k’ A47: SLA

Q48: Which of the following is Prometheus’ built in dashboarding/visualization feature? A48: Console templates

Q49: Which query below will give the active bytes on instance 10.1.1.1:9100 45m ago? A49: node_memory_Active_bytes{instance="10.1.1.1:9100"} offset 45m

Q50: What type of metric should be used for measuring internal temperature of a server? A50: gauge

Q51: What is the name of the cli utility that comes with Prometheus? A51: promtool

Q52: How can alertmanager prevent certain alerts from generating notification for a temporary period of time? A52: Configuring a silence

Q53: In the scrape configs for a pushgateway, what is the purpose of the honor_labels: true

scrape_configs:
  - job_name: pushgateway
    honor_labels: true
    static_configs:
      - targets: ["192.168.1.168:9091"]

A53: Allows metrics to specify the instance and job labels instead of pulling it from scrape_configs

Q54: Analayze the example alertmanager configs and determine when an alert with the following labels arrives on alertmanager, what receiver will it send the alert to team: backend and severity: critical

route:
  receiver: general-email
  routes:
    - receiver: frontend-email
      matchers:
        - team: frontend
      routes:
        - matchers:
            severity: critical
          receiver: frontend-pager
    - receiver: backend-email
      matchers:
        - team: backend
      routes:
        - matchers:
            severity: critical
          receiver: backend-pager
    - receiver: auth-email
      matchers:
        - team: auth
      routes:
        - matchers:
            severity: critical
          receiver: auth-pager
  receiver: auth-pager

A54: backend-pager

Q55: Which of the following would make for a poor SLI? A55: high disk utilization [things like CPU, memory, disk utilization are poor as user may not experience any degradation of service during these events]

Q56: Which of the following is not a valid way to reload Prometheus configuration? A56: promtool config reload

Q57: Which of the following is not something that is tracked in a span within a trace? A57: complexity

Q58: You are writing your own exporter for a Redis database. Which of the following would be the correct name for a metric to represent used memory on the by the Redis instance? A58: redis_mem_used_bytes [the first should be the app, second metric name, third the unit]

Q59: Which cli command can be used to verify/validate prometheus configurations? A59: promtool check config

Q60: Which query will return targets who have more than 50 arp entries? A60: node_arp_entries{job="node"} > 50

Mock Exam

2

Q1: What data type do Prometheus metric values use? A1: 64bit floats

Q2: The metric node_fan_speed_rpm tracks the current fan speeds. The location label specifies where on the server the fan is located. Which query will return the fan speeds for all fans except the rear fan A2: node_fan_speed_rpm{location!="rear"}

Q3: With the following alertmanager configs, after a notification has been sent out, a new alert comes in. How long will alertmanager wait before firing a new notification?

route:
  receiver: staff
  group_by: ['severity']
  group_wait: 60s
  group_interval: 15m
  repeat_interval: 12h
  routes:
    - matches:
job: kubernetes
      receiver: infra
      group_by: ['severity']

A3: 15m [the group_interval property determines how long alertmanager will wait after sending a notification ebfore it sends a new notification for a group]

Q4: What is the purpose of Prometheus scrape_interval? A4: defines how frequently to scrape a target

Q5: The metric http_requests tracks the total number of requests across each endpoint and method. What query will return the total number of requests for each path

http_requests{method="get", path="/auth"} 3​
http_requests{method="post", path="/auth"} 1​
http_requests{method="get", path="/user"} 4​
http_requests{method="post", path="/user"} 8​
http_requests{method="post", path="/upload"} 2​
http_requests{method="get", path="/tasks"} 4​
http_requests{method="put", path="/tasks"} 6​
http_requests{method="post", path="/tasks"} 1​
http_requests{method="get", path="/admin"} 3​
http_requests{method="post", path="/admin"} 9

A5: sum by(path) (http_requests)

Q6: An application is advertising metrics at the path /monitoring/stats. What property in the scrape configs needs to be modified? A6: metrics_path: "/monitoring/stats"

Q7: Analyze the alertmanager configs below : Based off the alert below, which receiver will send the notification for the alert alert labels: team: frontend

route:
  group_wait: 20s
  receiver: general
  group_by: ['alertname']
  routes:
    - match:
        org: kodekloud
      receiver: kodekloud-pager
    - match:
        org: apple
      receiver: apple

A7: general

Q8: What type of database does Prometheus use? A8: Time-series database

Q9: Which of the following is Prometheus’ built in dashboarding/visualization feature? A9: Console templates

Q10: What command should be used to verify that a Prometheus config is valid? A10: promtool check config prometheus.yml

Q11: What type of data should prometheus monitor? A11: numeric

Q12: What is the default port that Prometheus listens on? A12: 9090

Q13: A car reports the number of miles it has been driven with the metric car_total_miles Which query returns what is the average rate of miles the car has driven the past 2 hours. Use a 4m sample range and a query interval of 1m. A13: avg_over_time(rate(car_total_miles[4m]) [2h:1m])

Q14: Groups and rules within a group are run sequentially A14: Alert labels can be used as metadata so alertmanager can match on them and perform routing policies, annotations should be used for cosmetic descriptions of the alerts

Q15: What method does Prometheus use to collect metrics from targets? A15: pull

Q16: Which of the following is not a form of observability? A16: streams

Q17: How is application instrumentation achieved? A17: Client libraries

Q18: Which query below will give the 95% quantile of the metric http_file_upload_bytes? A18: histogram_quantile(0.95, http_file_upload_bytes_bucket)

Q19: What is this an example of 99% availability with a median latency less than 300ms? A19: SLO

Q20: What is the default path Prometheus will scrape to collect metrics? A20: /metrics

Q21: Where are alert rules defined? A21: In a separate rules file on the Prometheus server

Q22: kafka_topic_partition_replicas metric tracks the number of partitions for a topic/partition. Which query will get the number of partitions for the past 2 hours. Result should return a range vector. A22: kafka_topic_partition_replicas[2h]

Q23: The metric http_errors_total has 3 labels: path, method, error. Which of the following queries will give the total number of errors for a path of /auth, method of POST, and error code of 401? A23: http_errors_total{path="/auth", method="POST", code="401"}

Q24: What update needs to occur to add an annotation called description that prints out the message redis server <insert instance name> is down! A24: description: "redis server {{.Labels.instance}} is down!"

Q25: Which statement is true regarding Prometheus rules? A25: Groups are run in parallel, and rules within a group are run sequentially

Q26: What does the following config do?

scrape_configs:
  - job_name: "demo"
    metric_relabel_configs:
      - regex: fstype
        action: labeldrop

A26: The label fstype will be dropped for all metrics

Q27: The metric node_filesystem_avail_bytes reports the available bytes for each filesystem on a node. Which query will return all filesystems that has either less than 1000 available bytes or greater than 50000 bytes A27: node_filesystem_avail_bytes < 1000 or node_filesystem_avail_bytes > 50000

Q28: For metric_relabel_configs and relabel_configs, when matching on multiple source labels, what is the default delimiter A28: ;

Q29: Which of the following is not a valid method for reloading alertmanager configuration? A29: hit the reload config button in alertmanager web ui

Q30: Which of the following components is responsible for receiving metrics from short lived jobs? A30: pushgateway

Q31: For a histogram metric, what are the different submetrics? A31: __count, __bucket, __sum

Q32: Which query will return whether or not a target is currently able to be scraped? A32: up

Q33: What does the double underscore __ before a label name signify? A33: The label is reserver label

Q34: Which configuration in alertmanager will wait 2 minutes before firing off an alert to prevent unnecessary notifications getting sent? A34: group_wait: 2m [when an alert arrives on alertmanager, it will wait for the amoount of time specified in group_wait to wait for other alerts to arrive before firing off a notification]

Q35: Which of the following is not a component of the Prometheus solution? A35: influxdb

Q36: Which component of the Prometheus architecture should be used to automatically discover all nodes in a Kubernetes cluster? A36: service discovery

Q37: The metric mealplanner_consumed_calories tracks the number of calories that have been consumed by the user. What query will return the amount of calories that had been consumed 4 days ago? A37: mealplanner_consumed_calories offset 4d

Q38: Which of the following would make for a good SLI? A38: request failures [For good SLIs metrics, use metrics that impact the user's experience. Disk utilization, memory utilization, fan speed, and server temperature are not things that impact the user. Request failures will impact a user’s experience for sure]

Q39: What does the following config do?

scrape_configs:
  - job_name: "demo"
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: docker_container_crash_total
        action: replace
        target_label: __name__
        replacement: docker_container_restart_total

A39: Renames the metric docker_container_crash_total to docker_container_restart_total

Q40: What type of metric should be used to track the number of miles a car has driven? A40: counter

Q41: What type of metric should be used for measuring a users heart rate? A41: gauge

Q42: What is the purpose of repeat_interval in alertmanager? A42: How long to wait before sending a notification again if it has already been sent successfully for an alert

Q43: Which of the following components is responsible for collecting metrics from an instance and exposing them in a format Prometheus expects? A43: exporters

Q44: What are the two attributes that metrics can have? A44: TYPE, HELP

Q45: What query will return all the instances whose active memory bytes is less than 10000? A45: node_memory_Active_bytes < 10000

Q46: How many labels does the following time series have node_fan_speed{instance=“node8”, job=“server”, fan=“2”}? A46: 3

Q47: In the prometheus configuration, what is the purpose of the scheme field? A47: Determines if Prometheus will use HTTP or HTTPS

Q48: The metric health_consumed_calories tracks how many calories a user has eaten and health_burned_calories tracks the number of calories burned while exercising. To calculate net calories for the day subtract health_burned_calories from health_consumed_calories. Based on the time series below, which expression successfully calculates net calories.

health_consumed_calories{job=“health”, meal=“dinner”} 800
health_burned_calories{job=“health”, activity=“cardio”} 200

A48: health_consumed_calories - ignoring(meal, activity) health_burned_calories

Q49: What does the following config do?

scrape_configs:
 - job_name: example
   relabel_configs:
    - source_labels: [env, team]
      regex: dev;marketing
      action: drop

A49: Drops all targets whose env label is set to dev and team label is set to marketing

Q50: What is the name of the Prometheus query language? A50: PromQL

Q51: You are writing an exporter for RabbitMQ and are creating a metric to track the size of the message queue. Which of the following would be an appropriate name for the metric. A51: rabbitmq_message_bytes

Q52: What are the different states a Prometheus alert can be in? A52: inactive, pending, firing

Q53: Which statement is true about the rate/irate functions? A53: rate() calculates average rate over entire interval, irate() calculates the rate only between the last two datapoints in an interval

Q54: What does the following config do?

scrape_configs:
  - job_name: "example"
    metric_relabel_configs:
      - source_labels: [team]
        regex: (.*)
        action: replace
        target_label: organization
        replacement: org-$1

A54: renames the team label to organization and the value of the label will get prepended with org-

Q55: Analayze alertmanager configs below. Based off the following alert which receiver will receive the notification alertname: node_filesystem_full, labels: team: frontend, notification: pager

route:
  receiver: general-email
  group_by: [alertname]
  routes:
    - receiver: frontend-email
      matchers:
        - team: frontend
      routes:
        - matchers:
            notification: pager
          receiver: frontend-pager
    - receiver: backend-email
      matchers:
        - team: backend
    - receiver: auth-email
      matchers:
        - team: auth

A55: frontend-pager

Q56: A database backup service has an SLO that states that 97% of all backup jobs will be completed within 60s. A histogram metric is configured to track the backup process time, which of the following bucket configurations is recommended for the desired SLO? A56: 35, 45, 55, 60, 65, 75, 100 [Since histogram quantiles are approximations, to find out if a SLO has been met, make sure that a bucket is specified at the desired SLO value of 60s. The exact number (60s) must be present in the list.]

Q57: Which of the following is not a valid time value to be used in a range selector? A57: 3hr

Q58: What type of data does Prometheus collect? A58: numeric

Q59: The node_cpu_seconds_total metric tracks the number of seconds cpu has spent in a specific mode. The metric will break it down per cpu using the cpu label. Which query will return the total time all cpus on an instance spent in a mode that is not idle. Make sure to group the result on a per instance basis.

node_cpu_seconds_total{cpu="0", instance="192.168.1.168:9100", job="test", mode="idle"}
node_cpu_seconds_total{cpu="0", instance="192.168.1.168:9100", job="test", mode="iowait"}
node_cpu_seconds_total{cpu="0", instance="192.168.1.168:9100", job="test", mode="irq"}
node_cpu_seconds_total{cpu="0", instance="192.168.1.168:9100", job="test", mode="nice"}
node_cpu_seconds_total{cpu="0", instance="192.168.1.168:9100", job="test", mode="softirq"}
node_cpu_seconds_total{cpu="0", instance="192.168.1.168:9100", job="test", mode="steal"}
node_cpu_seconds_total{cpu="0", instance="192.168.1.168:9100", job="test", mode="system"}
node_cpu_seconds_total{cpu="1", instance="192.168.1.168:9100", job="test", mode="idle"}
node_cpu_seconds_total{cpu="1", instance="192.168.1.168:9100", job="test", mode="iowait"}
node_cpu_seconds_total{cpu="1", instance="192.168.1.168:9100", job="test", mode="irq"}
node_cpu_seconds_total{cpu="1", instance="192.168.1.168:9100", job="test", mode="nice"}
node_cpu_seconds_total{cpu="1", instance="192.168.1.168:9100", job="test", mode="softirq"}
node_cpu_seconds_total{cpu="1", instance="192.168.1.168:9100", job="test", mode="steal"}
node_cpu_seconds_total{cpu="1", instance="192.168.1.168:9100", job="test", mode="system"}

A59: sumb by(instance) (node_cpu_seconds{mode!="idle"})

Q60: The following time series return values with a lot of decimal values. What query will return values rounded down to the closest integer node_cpu_seconds_total {cpu=“0”, mode=“idle”} 115.12​ {cpu=“0”, mode=“irq”} 87.4482​ {cpu=“0”, mode=“steal”} 44.245 A60: floor(node_cpu_seconds_total)

Prometheus Certified Associate (PCA)

Curriculum

  1. 28% PromQL
  • Selecting Data
  • Rates and Derivatives
  • Aggregating over time
  • Aggregating over dimensions
  • Binary operators
  • Histograms
  • Timestamp Metrics
  1. 20% Prometheus Fundamentals
  • System Architecture
  • Configuration and Scraping
  • Understanding Prometheus Limitations
  • Data Model and Labels
  • Exposition Format
  1. 18% Observability Concepts
  • Metrics
  • Understand logs and events
  • Tracing and Spans
  • Push vs Pull
  • Service Discovery
  • Basics of SLOs, SLAs, and SLIs
  1. 18% Alerting & Dashboarding
  • Dashboarding basics
  • Configuring Alerting rules
  • Understand and Use Alertmanager
  • Alerting basics (when, what, and why)
  1. 16% Instrumentation & Exporters
  • Client Libraries
  • Instrumentation
  • Exporters
  • Structuring and naming metrics

Observability Fundamentals

Observability

  • the ability to understand and measure the state of a system based on data generated by the system
  • allows to generate actionable outputs from unexpected scenarios
  • to better understand the internals of your system
  • greater need for observability in distributed systems & microservices
  • troubleshooting - e.g. why are error rates high?
  • 3 pillars of observability:
    1. Logs - records of events that have occurred and encapsulate info about the specific event
    2. Metrics - numerical value/information about the state, data can be aggregated over time, contains name, value, timestamp, dimensions
    3. Traces - follow operations (trace-id) as they travel through different hops, spans are events forming a trace
  • Prometheus only handles metrics, not logs or traces!

SLO/SLA/SLI

a. SLI (service level indicators) = quantitative measure of some aspect of the level of service provided (availability, latency, error rate etc.)

  • not all metrics make for good SLIs, you want to find metrics that accurately measure a user's experience
  • high CPU, high memory are poor SLIs as they don't necessarily affect user's experience

b. SLO (service level objectives) = target value or range for an SLI

  • examples:
    • SLI = Latency
    • SLO = Latency < 100ms
    • SLI = Availability
    • SLO = 99.99% uptime
  • should be directly related to the customer experience
  • purpose is to quantify reliability of a product to a customer
  • may be tempted to set unnecessarily aggressive values
  • goal is not to achieve perfection, but make customers happy

c. SLA (service level agreement) = contract between a vendor and a user that guarantees SLO

Prometheus Fundamentals

  • open source monitoring tool that collects metrics data and provide tools to visualize the data
  • use cases:
    • collect metrics from different locations (e.g. like West DC, central DC, East DC, AWS etc.)
    • high memory on the hosting MySQL db and notify operations team via email
    • find out which uploaded video length the application starts to degrade
  • allows to generate alerts when threshold reached
  • collects data by scraping targets who expose metrics through HTTP endpoint
  • stored in time series db and can be queried with built-in PromQL (Prometheus Query Language)
  • what can it monitor:
    • CPU/memory
    • disk space
    • service uptime
    • app specific data - number of exceptions, latency, pending requests
    • networking devices, databases etc.
  • exclusively monitor numeric time-series data!
  • does not monitor events, system logs, traces!
  • originally sponsored by SoundCloud
  • written in Go

Prometheus Architecture

  • 3 core components:
    • Retrieval (scrapes metric data)
    • TSDB (time-series database stores metric data)
    • HTTP server (accepts PromQL query)
  • lots of other components making up the whole solution:
    • exporters (mini-processes running on the targets), retrieval component pulls the metrics from
    • pushgateway (short lived job sends the data to it and then retrieved from there)
    • service discovery is all about providing list of targets so you don't have to hardcode those values
    • alertmanager handles all of the emails, SMS, slack etc. after the alerts is pushed to it
    • Prometheus Web UI or Grafana etc.
  • collects by sending HTTP request to /metrics endpoint of each target, path can be changed via metrics_path
  • several native exporters:
    • node-exporters (Linux)
    • Windows
    • MySQL
    • Apache
    • HAProxy
    • client libraries to monitor application metrics (# of errors/exceptions, latency, job execution duration) for Go, Java, Python, Ruby, Rust
  • Pull based approach is better, because:
    • easier to tell if the target is down
    • does not DDoS the metrics server
    • definitive list of targets to monitor (central source of truth)
  • By default, the Prometheus server will use port 9090.

Prometheus Installation

  1. Download *.tar from http://prometheus.io/download
  2. untarred folder contains console_libraries, consoles, prometheus (binary), prometheus.yml (config) and promtool (CLI utility) + docs
  3. Run ./prometheus - does it work?
  4. Open http://localhost:9090 - does it work?
  5. Execute / query up in the console to see the one target (itself) - should work OK so we can turn it into a systemd service now:
    1. Create a new/separate user: sudo useradd --no-create-home --shell /bin/false prometheus
    2. Create a config folder: sudo mkdir /etc/prometheus
    3. Create folder /var/lib/prometheus for the data
    4. Move executables: sudo cp prometheus /usr/local/bin ; sudo cp promtool /usr/local/bin
    5. Move config file: sudo cp prometheus.yaml /etc/prometheus/
    6. Copy the consoles folder: sudo cp -r consoles /etc/prometheus/ ; sudo cp -r console_libraries /etc/prometheus/
    7. Change owner for these folders & executables: sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus /usr/local/bin/prometheus /usr/local/bin/promtool
    8. The command (ExecStart) in the service file will then look like this: sudo -u prometheus /usr/local/bin/prometheus --config.file /etc/prometheus/prometheus.yaml --storage.tsdb.path /var/lib/prometheus --web.console.templates /etc/prometheus/consoles --web.console.libraries=/etc/prometheus/console_libraries
    9. Create a service file with this information /etc/systemd/system/prometheus.service and reload sudo systemctl daemon-reload
    10. Start the daemon sudo systemctl start prometheus ; sudo systemctl enable prometheus

Node exporter

  • Download *.tar from http://prometheus.io/download
  • untarred folder contains basically just the binary node_exporter
  • The node_exporter listens on HTTP port 9100 by default
  • Run the ./node_exporter and then curl localhost:9100/metrics
  • Run in the background & start on boot using the systemd, very similar to Prometheus installation:
    sudo cp node_exporter /usr/local/bin
    sudo useradd --no-create-home --shell /bin/false node_exporter
    sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter
    sudo vi /etc/systemd/system/node_exporter.service
    sudo systemctl daemon-reload
    sudo systemctl start node_exporter ; sudo systemctl enable node_exporter

Prometheus configuration

  • Sections:
    1. global - Default parameters, it can be overridden by the same variables in sub-sections
    2. scrape_configs - Define targets and job_name, which is a collection of instances that need to be scraped
    3. alerting - Alerting specifies settings related to the Alertmanager
    4. rule_files - Rule files specifies a list of globs, rules and alerts are read from all matching files
    5. remote_read & remote_write - Settings related to the remote read/write feature
    6. storage - Storage related settings that are runtime reloadable
  • Example config:
    scrape_configs:
      - job_name: 'nodes'               # call it whatever
        scrape_interval: 30s            # from the target every X seconds
        scrape_timeout: 3s              # timeouts after X seconds
        scheme: https                   # http or https
        metrics_path: /stats/metrics    # non-default path that you send requests to
        static_configs:
          - targets: ['10.231.1.2:9090', '192.168.43.9:9090'] # two IPs
        # basic_auth                    # this is the next section
  • To reload the config: sudo systemctl restart prometheus

Encryption & Authentication

  • between the Prometheus and the targets

Encryption

  1. On the targets, you need to generate the key & crt pair first - e.g.:
    • sudo openssl req -new -newkey rsa:2048 -days 465 -nodex -x509 -keyout node_exporter.key -out node_exporter.crt -subj "..." -addtext "subjectAltName = DNS:localhost"
    • then target config will have to be customized after that:
      # /etc/node_exporter/config.yml
      tls_server_config:
        # Certificate and key files for server to use to authenticate to client
        cert_file: node_exporter.crt
        key_file: node_exporter.key
    • The exporter supports TLS via a new web configuration file: ./node_exporter --web.config=config.yml
    • Test with: curl -k https://localhost:9100/metrics
  2. On the server, you need:
    • copy the node_exporter.crt from the target to the Prometheus server
    • update the scheme to https in the prometheus.yml and add tls_config with ca_file (e.g. /etc/prometheus/node_exporter.crt that we copied in the previous step) and insecure_skip_verify if self-signed:
      # /etc/prometheus/prometheus.yaml
      scrape_configs:
        - job_name: "node"
          scheme: https
          tls_config:
            # Certificate and key files for client cert authentication to the server
            ca_file: /etc/prometheus/node_exporter.crt
            insecure_skip_verify: true
    • restart prometheus service

Authentication

  • Authentication is done via generated hash (sudo apt install apache2-utils or httpd-tools etc.) and then: htpasswd -nBC 12 "" | tr -d ':\n' (will prompt for password and spits out the hash)
  • add the basic_auth_users and username + generated hash underneath it:
    # /etc/node_exporter/config.yml
    basic_auth_users:
      prometheus: $2y$12$daXru320983rnofkwehj4039F
  • restart node_exporter service
  • update Prometheus server's config with the same auth and restart Prometheus:
    - job_name: "node"
      basic_auth:
        username: prometheus
        password: <PLAIN TEXT PASSWORD!>

Metrics

  • 3 properties:
    • name - general feature of a system to be measured, may contain ASCII, numbers, underscores ([a-zA-Z_:][a-zA-Z0-9_:]*), colons are reserved only for recording rules. Metric names cannot start with a number. Name is technically a label (e.g. __name__=node_cpu_seconds_total)
    • {labels (key/value pairs)} - allows split up a metric by a specified criteria (e.g. multiple CPUs, specific HTTP methods, API endpoints etc), metrics can have more than 1 label, ASCII, numbers, underscores ([a-zA-Z0-9_]*). Labels surrounded by __ are considered internal to Prometheus. Every metric is assigned 2 labels by default (instance and job).
    • value of the metric
  • Example = node_cpu_seconds_total{cpu="0",mode="idle"} 258277.86: labels provide us information on which CPU this metric is for (cpu number zero)
  • when Prometheus scrapes a target and retrieves metrics, it also stores the time at which the metric was scraped
  • Example = 1668215300 (unix epoch timestamp, since Jan 1st 1970 UTC)
  • time series = stream of timestamped values sharing the same metric and set of labels
  • metric have a TYPE (counter, gauge, histogram, summary) and HELP (description of the metric is) attributes
  • explanation of each types:
    • counter can only go up, e.g. how many times did X happened?
    • gauge can go up or down, e.g. what is the current value of X?
    • histogram tells how long or how big something is, groups observations into configurable bucket sizes (e.g. accumulative response time buckets <1s, <0.5s, <0.2s)
      • e.g. request_latency_seconds_bucket{le="0.05"} 50 - Buckets are cumulative (i.e. all request in the le=0.05 bucket will include all requests less than 0.05 which includes all requests that fall into the buckets below it (e.g 0.03, 0.02, 0.01...)
      • e.g. to calculate the histogram's quantiles, we would use histogram_quantile, approximation of the value of a specific quantile: 75% of all requests have what latency? histogram_quantile(0.75, request_latency_seconds_bucket). To get an accurate value, make sure there is a bucket at the specific value that needs to be met. Every time you add a bucket, it will slow the performance of the Prometheus!
    • summary is similar to histogram and tells us how many observation fell below X?, do not have to define quantiles ahead of time (similar to histogram, but percentages: response time 20% = <0.3s, 50% = <0.8s, 80% = <1s). Similarly to histogram, there will be _count and _sum metrics as well as quantiles like 0.7, 0.8, 0.9 (instead of buckets).
  • table - difference:
histogram summary
bucket sizes can be picked quantile must be defined ahead of time
less taxing on client libraries more taxing on client libraries
any quantile can be selected only quantiles predefined in client can be used
Prometheus server must calculate quantiles very minimal server-side cost

Quiz:

Q1: How many total unique time series are there in this output?

node_arp_entries{instance="node1" job="node"} 200
node_arp_entries{instance="node2" job="node"} 150
node_cpu_seconds_total{cpu="0", instance="node"1", mode="iowait"}
node_cpu_seconds_total{cpu="1", instance="node"1", mode="iowait"}
node_cpu_seconds_total{cpu="0", instance="node"1", mode="idle"}
node_cpu_seconds_total{cpu="1", instance="node"1", mode="idle"}
node_cpu_seconds_total{cpu="1", instance="node"2", mode="idle"}
node_memory_Active_bytes{instance="node1" job="node"} 419124
node_memory_Active_bytes{instance="node2" job="node"} 55589

A1: 9

Q2: What metric should be used to report the current memory utilization?

A2: gauge

Q3: What metric should be used to report the amount of time a process has been running?

A3: counter

Q4: Which of these is NOT a valid metric?

A4: 404_error_count

Q5: How many labels does the following time series have? http_errors_total{instance=“1.1.1.1:80”, job=“api”, code=“400”, endpoint="/user", method=“post”} 55234

A5: 5

Q6: A web app is being built that allows users to upload pictures, management would like to be able to track the size of uploaded pictures and report back the number of photos that were less than 10Mb, 50Mb, 100MB, 500MB, and 1Gb. What metric would be best for this?

A6: histogram

Q7: What are the two labels every metric is assigned by default?

A7: instance, job

Q8: What are the 4 types of Prometheus metrics?

A8: counter, gauge, histogram, summary

Q9: What are the two attributes provided by a metric?

A9: Help, Type

Q10: For the metric http_requests_total{path=”/auth”, instance=”node1”, job=”api”} 7782; What is the metric name?

A10: http_request_total

Q11: For the http_request_total metric, what is the query/metric name that would be used to get the count of total requests on node node01:3000?

A11: http_request_total_count{instance="node01:3000"}

Q12: Construct a query to return the total number of requests for the /events route with a latency of less than 0.4s across all nodes.

A12: http_request_total_bucket{route="/events",le="0.4"}

Q13: Construct a query to find out how many requests took somewhere between 0.08s and 0.1s on node node02:3000.

A13: ?

Q14: Construct a query to calculate the rate of http requests that took less than 0.08s. Use a time window of 1m across all nodes.

A14: ?

Q15: Construct a query to calculate the average latency of a request over the past 4 minutes. Use the formula below to calculate average latency of request: rate of sum-of-all-requests / rate of count-of-all-requests

A15: ?

Q16: Management would like to know what is the 95th percentile for the latency of requests going to node node01:3000. Construct a query to calculate the 95th percentile. A16: ?

Q17: The company is now offering customers an SLO stating that, 95% of all requests will be under 0.15s. What bucket size will need to be added to guarantee that the histogram_quantile function can accurately report whether or not that SLO has been met?

A17: 0.15

Q18: A summary metric http_upload_bytes has been added to track the amount of bytes uploaded per request. What are percentiles being reported by this metric?

  1. 0.02, 0.05, 0.08, 0.1, 0.13, 0.18, 0.21, 0.24, 0.3, 0.35, 0.4
  2. 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99
  3. events, tickets
  4. 200, 201, 400, 404

A18: ?

Expression browser

  • Web UI for Prometheus server to query data
  • up - returns which targets are in up state (you can see an instance and job and value on the right - 0 and 1)

Prometheus on Docker

  • Pull image prom/prometheus
  • Configure prometheus.yml
  • Expose ports, bind mounts
  • Run: docker run -d /path-to/prometheus.yml:/etc/prometheus/prometheus.yml -p 9090:9090 prom/prometheus

PromTools

  • check & validate configuration before applying (e.g before production)
  • prevent downtime while config issues are being identified
  • validate metrics passed to it are correctly formatted
  • can perform queries on a Prom server
  • debugging & profiling a Prom server
  • perform unit tests against Recording/Alerting rules
  • To check/validate config, run: promtool check config /etc/prometheus/prometheus.yml

Container metrics

  • metrics can be scraped from containerized envs

Docker engine metrics (how much CPU does Docker use etc. not metrics specific to a container!)

  • vi /etc/docker/daemon.json:
    {
      "metrics-addr": "127.0.0.1:9323",
      "experimental": true
    }
  • sudo systemctl restart docker
  • curl localhost:9323/metrics
  • Prometheus job update:
    scrape_configs:
      - job_namce: "docker"
        static_configs:
          - targets: ["12.1.13.4:9323"]

cAdvisor (how much memory does each container use? container uptime? etc.)

  • vi docker-compose.yml to pull gcr.io/cadvisor/cadvisor
  • docker-compose up or docker compose up
  • curl localhost:8080/metrics

PromQL

  • short for Prometheus Query Language
  • data returned can be visualized in dashboards
  • used to build alerting rules to notify about thresholds

Data Types

  1. String (currently unused)
  2. Scalar - numeric floating point value (e.g. 54.743)
  3. Instant vector - set of time series containing a single sample for each time series sharing the same timestamp (e.g. node_cpu_seconds_total finds all unique labels and value for each and they all will going to be at a single point in time)
  4. Range vector - set of time series containing a range of data points over time for each time series (e.g. node_cpu_seconds_total[3m] finds all unique labels, but all values and timestamps from the past 3 minutes)

Selectors

  • if we only want to return a subset of times series for a metric = label matchers:
    • exact match = (e.g. node_filesystem_avail_bytes{instance="node1"})
    • negative equality != (e.g. node_filesystem_avail_bytes{device!="tmpfs"})
    • regular expression =~ (e.g. starts with /dev/sda - node_filesystem_avail_bytes{device=~"/dev/sda.*"})
    • negative regular expression !~ (e.g. mountpoint does not start with /boot - node_filesystem_avail_bytes{mountpoint!~"/boot.*"})
  • we can combine multiple selectors with comma ,: (e.g. node_filesystem_avail_bytes{instance="node1",device!="tmpfs"})

Modifiers

  • to get historic data, use an offset modifier after the label matching (e.g. get value 5 minutes ago - node_memory_active_bytes{instance="node1"} offset 5m)
  • to get to the exact point in time (e.g. get value on September 15 - node_memory_active_bytes{instance="node1"} @1663265188)
  • you can use both modifiers and order does not matter (e.g. @1663265188 offset 5m = offset 5m @1663265188)
  • you can also add range vectors (e.g. get 2 minutes worth of data 10 minutes before September 15 [2m] @1663265188 offset 5m)

Operators

  • between instant vectors and scalars
  • types:
    1. Arithmetic +, -, *, /, %, ^ (e.g. node_memory_Active_bytes / 1024 - but it drops the metric name in the output as it is no longer the original metric!)
    2. Comparison ==, !==, >, <, >=, <=, bool (e.g. node_network_flags > 100, node_network_receive_packets_total >= 220, node_filesystem_avail_bytes < bool 1000 returns 0 or 1 mostly for generating alerts)
    3. Logical OR, AND, UNLESS (e.g. node_filesystem_avail_bytes > 1000 and node_filesystem_avail_bytes < 3000). Unless operator results in a vector consisting of elements on the left side for which there are no elements on the right side (e.g. return all vectors greater than 1000 unless they are greater than 30000 node_filesystem_avail_bytes > 1000 unless node_filesystem_avail_bytes > 30000)
    4. more than one operator follows the order of precedence from highest to lowest, while operators on the same precedence level are performed from the left (e.g. 2 * 3 % 2 = (2 * 3) % 2), however power is performed from the left (e.g. 2 ^ 3 ^ 2 = 2 ^ (3 ^ 2)):
high  ^   ^
      |   *, /, %, atan2
      |   +, -
      |   ==, !=, <=, <, >=, >
      |   and, unless
low   |   or

Quiz

Q1: Construct a query to return all filesystems that have over 1000 bytes available on all instances under web job.

A1: node_filesystem_avail_bytes{job="web"} > 1000

Q2: Which of the following queries you will use for loadbalancer:9100 host to return all the interfaces that have received less than or equal to 10000 bytes of traffic?

A2: node_network_receive_bytes_total{instance="loadbalancer:9100"} <= 10000

Q3: node_filesystem_files tracks the filesystem's total file nodes. Construct a query that only returns time series greater than 500000 and less than 10000000 across all jobs

A3: node_filesystem_files > 500000 and node_filesystem_files < 10000000

Q4: The metric node_filesystem_avail_bytes lists the available bytes for all filesystems, and the metric node_filesystem_size_bytes lists the total size of all filesystems. Run each metric and see their outputs. There are three properties/labels these will return: device, fstype, and mountpoint. Which of the following queries will show the percentage of free disk space for all filesystems on all the targets under web job whose device label does not match tmpfs?

A4: node_filesystem_avail_bytes{job="web", device!="tmpfs"}*100 / node_filesystem_size_bytes{job="web", device!="tmpfs"}

Vector matching

  • between 2 instant vectors (e.g. to get the percentage of free space node_filesystem_avail_bytes / node_filesystem_size_bytes * 100 )
  • samples with exactly the same labels get matched together (e.g. instance and job and mountpoint must be the same to get a match) - every element in the vector on the left tries to find a single matching element on the right
  • to perform operation on 2 vectors with differing labels like http_errors code="500", code="501", code="404", method="put" etc. use the ignoring keyword (e.g. http_errors{code="500"} / ignoring(code) http_requests)
  • if the entries with e.g. methods put and del have no match in both metrics http_errors and http_requests, they will not show up in the results!
  • to get results on all labels to match on, we use the on keyword (e.g. http_errors{code="500"} / on(method) http_requests)
  • table - matching:
vector1 + vector2 = resulting vector
{cpu=0,mode=idle} {cpu=1,mode=steal} {cpu=0}
{cpu=1,mode=iowait} {cpu=2,mode=user} {cpu=1}
{cpu=2,mode=user} {cpu=0,mode=idle} {cpu=2}
  • Resulting vector will have matching elements with all labels listed in on or all labels not/ignored: e.g. vector1{}+on(cpu) vector2{} or vector1{}+ignore(mode) vector2{}
  • Another example is: http_errors_total / ignoring(error) http_requests_total = http_errors_total / on(instance, job, path) http_requests_total

Quiz

Q1: Which of the following queries can be used to track the total number of seconds cpu has spent in user + system mode for instance loadbalancer:9100?

A1: node_cpu_seconds_total{instance="loadbalancer:9100", mode="user"} + ignoring(mode) node_cpu_seconds_total{instance="loadbalancer:9100", mode="system"}

Q2: Construct a query that will find out what percentage of time each cpu on each instance was spent in mode user. To calculate the percentage in mode user, get the total seconds spent in mode user and divide that by the sum of the time spent across all modes. Further, multiply that result by 100 to get a percentage.

A2: node_cpu_seconds_total{mode="user"}*100 / ignoring(mode, job) sum by(instance, cpu) (node_cpu_seconds_total)

Many-to-one vector matching

  • when you get error executing the query multiple matches for labels: many-to-one matching must be explicit (group_left/group_right)
  • it is where each vector elements on the one side can match with multiple elements on the many side (e.g. http_errors + on(path) group_left http_requests) - group_left tells PromQL that elements from the right side are now matched with multiple elements from the left (group_right is the opposite of that - depending on which side is the many and which side is one)
many + one = resulting vector
{error=400,path=/cats} 2 {error=400,path=/cats} 4
{error=500,path=/cats} 5 {path=/cats} 2 {error=500,path=/cats} 7
{error=400,path=/dogs 1 {path=/dogs} 7 {error=400,path=/dogs} 8
{error=500,path=/dogs 7 {error=500,path=/dogs} 14

Quiz

Q1: The api job collects metrics on an API used for uploading files. The API has 3 endpoints /images, /videos and /songs, which are used to upload respective file types. The API provides 2 metrics to track: http_uploaded_bytes_total - tracks the number of uploaded bytes and http_upload_failed_bytes_total - tracks the number of bytes failed to upload. Construct a query to calculate the percentage of bytes that failed for each endpoint. The formula for the same is http_upload_failed_bytes_total*100 / http_uploaded_bytes_total.

A1: http_upload_failed_bytes_total*100 / ignoring(error) group_left http_uploaded_bytes_total

Aggregation operators

  • allow you to take an instan vector and aggregate its elements resulting in a new instant vector with fewer elements
  • sum, min, max, avg, group, stddev, stdvar, count, count_values, bottomk, topk, quantile
  • for example sum(http_requests), max(http_requests)
  • by keyword allows you to choose which labels to aggregate along (e.g. sum by(path) (http_requests), sum by(method) (http_requests), sum by(instance) (http_requests), sum by(instance, method) (http_requests))
  • without keyword does the opposite of by and tells the query which labels not to include in aggregation (e.g. sum without(cpu, mode) (node_cpu_seconds_total))

Quiz

Q1: On loadbalancer:9100 instance, calculate the sum of the size of all filesystems. The metric to get filesystem size is node_filesystem_size_bytes

A1: sum(node_filesystem_size_bytes{instance="loadbalancer:9100"})

Q2: Construct a query to find how many CPUs instance loadbalancer:9100 have. You can use the node_cpu_seconds_total metric to find out the same.

A2: count(sum by (cpu) (node_cpu_seconds_total{instance="loadbalancer:9100"}))

Q3: Construct a query that will show the number of CPUs on each instance across all jobs.

A3: ?

Q4: Use the node_network_receive_bytes_total metric to calculate the sum of the total received bytes across all interfaces on per instance basis

A4: sum by(instance)(node_network_receive_bytes_total)

Q5: Which of the following queries will be used to calculate the average packet size for each instance?

A5: sum by(instance)(node_network_receive_bytes_total) / sum by(instance)(node_network_receive_packets_total)

Functions

  • sorting, math, label transformations, metric manipulation
  • use the round function to round the query's result to the nearest integer value
  • truncate/round up to the closest integer: ceil(node_cpu_seconds_total)
  • round down: floor(node_cpu_seconds_total)
  • absolute value for negative numbers: abs(1-node_cpu_seconds_total)
  • date & time: time(), minute() etc.
  • vector function takes a scalar value and converts it into an instant vector: vector(4)
  • scalar function returns the value of the single element as a scalar (otherwise returns NaN if the input vector does not have exactly one element): scalar(process_start_time_seconds)
  • sorting: sort (ascending) and sort_desc (descending)
  • rate at which a counter metric increases: rate and irate (e.g. group together data points by 60 seconds, get last value minus first value in each of these 60s groups and divide it by 60: rate(http_errors[1m]); irate is similar to rate, but you get the last value and the second to last data points: irate(http_errors[1m]))
  • table - difference:
rate irate
looks at the first and last data points within a range looks at the last two data points within a range
effectively an average rate over the range instant rate
best for slow moving counters and alerting rules should be user for graphing volatile fast-moving counters

Notes:

  • make sure there is at least 4 samples within the time range (e.g. 15s scrape interval 60s window gives 4 samples)
  • when combining rate with an aggregation operator, always take rate() first, then aggregate (so it can detect counter resets)
  • to get the rate of increase of the sum of latency across all requests: rate(requests_latency_seconds_sum[1m])
  • to calculate the average latency of a request over the past 5m: rate(requests_latency_seconds_sum[5m]) / rate(requests_latency_seconds_count[5m])

Quiz

Q1: Management wants to keep track of the rate of bytes received by each instance. Each instance has two interfaces, so the rate of traffic being received on them must be summed up. Calculate the rate of received node_network_receive_bytes_total using 2-minute window, sum the rates across all interfaces, and group the results by instance.

A1: sum by(instance) (rate(node_network_receive_bytes_total[2m]))

Subquery

  • Syntax: <instant_query> [<range>:<resolution>] [offset <duration>]
  • Example: rate(http_requests_total[1m]) [5m:30s] - where sample range is 1m, query range is data from the last 5m and query step for subquery is 30s (gap between)
  • maximum value over a 10min of a gauge metric (max_over_time(node_filesystem_avail_bytes[10m]))
  • for counter metrics, we need to find the max value of the rate over the past 5min (e.g. maximum rate of request from the last 5 minutes with a 30s query interval and a sample range of 1m: max_over_time(rate(http_requests_total[1m]) [5m:30s])

Quiz

Q1: There were reports of a small outage of an application in the past few minutes, and some alerts pointed to potential high iowait on the CPUs. We need to calculate when the iowait rate was the highest over the past 10 minutes. [Construct a subquery that will calculate the rate at which all cpus spent in iowait mode using a 1 minute time window for the rate function. Find the max value of this result over the past 10 minutes using a 30s query step for the subquery.]

A1: ?

Q2: Construct a query to calculate the average over time (avg_over_time) rate of http_requests_total over the past 20m using 1m query step.

A2: ?

Recording rules

  • allow Prometheus to periodically evaluate PromQL expression and store the resulting times series generated by them
  • speeding up your dashboards
  • provide aggregated results for use elsewhere
  • recording rules go in a separate file called a rule file:
    global: ...
    rule_files:
      - rules.yml # globs can be used here, like /etc/prometheus/rule_files.d/*.yml
    scrape_configs: ...
  • Prometheus server must be restarted after this change
  • syntax of the rules.yml file:
    groups: # groups running in parallel
      - name: <group_name_1>
        interval: <evaluation interval, global by default>
        rules: # however, rules evaluated sequentially
          - record: <rule_name_1>
            expr: <promql_expression_1>
            labels:
              <label_name>: <label_value>
          - record: <rule_name_2> # you can also reference previous rule(s)
            expr: <promql_expression_1>
            labels:
      - name: <group_name_2>
        ...
  • example of the rules.yml file:
    groups:
      - name: example1 # it will show up in the WebGui under "status" - "rules"
        interval: 15s
        rules:
          - record: node_memory_memFree_percent
            expr:  100 - (100 * node_memory_MemFree_bytes / node_memory_memTotal_bytes)
          - record: node_filesystem_free_percent
            expr: 100 * node_filesystem_free_bytes / node_filesystem_size_bytes
  • best practices for rule naming: aggregation_level:metric_name:operations, e.g. we have a http_errors counter with two instrumentation labels "method" and "path". All the rules for a specific job should be contained in a single group. It will look like:
    - record: job_method_path:http_errors:rate5m
      expr: sum without(instance) (rate(http_errors{job="api"}[5m]))

HTTP API

  • execute queries, gather information on alerts, rules, service discovery related configs
  • send the POST request to http://<prometheus_ip>/api/v1/query
  • example: curl http://<prometheus_ip>:9090/api/v1/query --data 'query=node_arp_entries{instance="192.168.1.168:9100"}'
  • query at a specific time, just add another --data 'time=169386192'
  • response back as JSON

Dashboarding & Visualization

  • several different ways:
    • expression browser with graph tab (built-in)
    • console templates (built-in)
    • 3rd party like Grafana
  • expression browser has limited functionality, only for ad-hoc queries and quick debugging, cannot create custom dashboards, not good for day-to-day monitoring, but can at least have multiple panels and compare graphs

Console Templates

  • allow to create custom HTML pages using Go templating language (typically between {{ and }})
  • Prometheus metrics, queries and charts can be embedded in the templates
  • ls /etc/prometheus/consoles to see the *.html and example (to see it, go to https://localhost:9090/consoles/index.html.example)
  • boilerplate will typically contain:
    {{ template "head" . }}
    {{ template "prom_content_head" . }}
    <h1>Memory details</h1>
    active memory: {{ template "prom_query_drilldown" (args "node_memory_Active_bytes") }}
    {{ template "prom_content_tail" . }}
    {{ template "tail" . }}
  • an example of inserting a chart:
    {{ template "head" . }}
    {{ template "prom_content_head" . }}
    <h1>Memory details</h1>
    active memory: {{ template "prom_query_drilldown" (args "node_memory_Active_bytes") }}
    <div id="graph"></div>
    <script>
    new PromConsole.Graph({
    node: document.querySelector("#graph"),
    expr: "rate(node_memory_Active_bytes[2m])"
    })
    </script>
    {{ template "prom_content_tail" . }}
    {{ template "tail" . }}
  • another example with memory/cpu graphs:
    {{ template "head" . }}
    {{ template "prom_content_head" . }}
    <h1>Node Stats</h1>
    <h3>Memory</h3>
    <strong>Memory utilization:</strong> {{ template "prom_query_drilldown" (args "100- (node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes*100)") }}
    <br/>
    <strong>Memory Size:</strong> {{ template "prom_query_drilldown" (args "node_memory_MemTotal_bytes/1000000" "Mb") }}
    <h3>CPU</h3>
    <strong>CPU Count:</strong> {{ template "prom_query_drilldown" (args "count(node_cpu_seconds_total{mode='idle'})") }}
    <br/>
    <strong>CPU Utilization:</strong> {{ template "prom_query_drilldown" (args "sum(rate(node_cpu_seconds_total{mode!='idle'}[2m]))*100/56") }}
    <!-->
    Expression explanation: The expression will take the current rate of all cpu modes except idle because idle means cpu isn’t being used. It will then sum them up and multiply them by 100 to give a percentage. This final number is divided by 56 (if this server/node has 56 CPUs, we want to get the utilization per CPU, so adjust this value as needed).
    </!-->
    <div id="cpu"></div>
    <script>
    new PromConsole.Graph({
    node: document.querySelector("#cpu"),
    expr: "sum(rate(node_cpu_seconds_total{mode!='idle'}[2m]))*100/2",
    })
    </script>
    <h3>Network</h3>
    <div id="network"></div>
    <script>
    new PromConsole.Graph({
    node: document.querySelector("#network"),
    expr: "rate(node_network_receive_bytes_total[2m])",
    })
    </script>
    {{ template "prom_content_tail" . }}
    {{ template "tail" . }}

Application Instrumentation

  • the Prometheus client libraries provide an easy way to add instrumentation to your code in order to track and expose metrics for Prometheus
  • they do 2 things:
    • Track metrics in the Prometheus expected format
    • Expose metrics via /metrics path so they can be scraped
  • official and unofficial libraries
  • Example for Python:
    • You have an existing API in Flask, run pip install prometheus_client
    • In your code, import it: from prometheus_client import Counter
    • Initialize counter object: REQUESTS = Counter('http_requests_total', 'Total number of requests')
    • When do we want to increment this? Within all of the @app.get("/path") like this: REQUESTS.inc()
    • We can also get total requests per path using different counter objects, but that is not recommended. Instead we can use labels:
      • REQUESTS = Counter('http_requests_total', 'Total number of requests', labelnames=['path'])
      • REQUESTS.labels('/cars').inc()
    • Then you can do the same approach for different HTTP method: labelnames=['path', 'method'] and REQUESTS.labels('/cars', 'post').inc()
    • How to expose to /metrics endpoint though?
      from prometheus_client import Counter, start_http_server
      if __name__ == '__main__':
        start_http_server(8000) # start the metrics server on port
        app.run(port='5001')    # this is the Flask app
    • curl 127.0.0.1:8000 will show the metrics
    • However, you can also expose the metrics from Flask route and have Flash app on http://localhost:5001 and Prometheus on http://localhost:5001/metrics like e.g. app.wsgi_app = DispatcherMiddleware(app.wsgi_app, { '/metrics': make_wsgi_app() })
  • complete working example:
    from flask import Flask
    from prometheus_client import Counter, start_http_server, Gauge
    
    REQUESTS = Counter('http_requests_total', 'Total number of requests', labelnames=['path', 'method'])
    
    ERRORS = Counter('http_errors_total',
                    'Total number of errors', labelnames=['code'])
    
    IN_PROGRESS = Gauge('inprogress_requests',
                        'Total number of requests in progress')
    
    def before_request():
        IN_PROGRESS.inc()
    
    def after_request(response):
        IN_PROGRESS.dec()
        return response
    
    app = Flask(__name__)
    
    @app.get("/products")
    def get_products():
        REQUESTS.labels('products', 'get').inc()
        return "product"
    
    @app.post("/products")
    def create_product():
        REQUESTS.labels('products', 'get').inc()
        return "created product", 201
    
    @app.get("/cart")
    def get_cart():
        REQUESTS.labels('products', 'get').inc()
        return "cart"
    
    @app.post("/cart")
    def create_cart():
        REQUESTS.labels('products', 'get').inc()
        return "created cart", 201
    
    @app.errorhandler(404)
    def page_not_found(e):
        ERRORS.labels('404').inc()
        return "page not found", 404
    
    if __name__ == '__main__':
        start_http_server(8000)
        app.run(debug=False, host="0.0.0.0", port='6000')

Implementing histogram & summary in your Python code (example)

# add histogram metric to track latency/response time for each request
LATENCY = Histogram('request_latency_seconds', 'Request Latency', labelnames=['path', 'method'])
# get before_request time via `request.start_time = time.time()`
# calculate after_request as `request_latency = time.time() minus request.start_time` and pass it to:
LATENCY.labels(request.method, request.path).observe(request_latency)
  • client libraries can let you specify bucket sizes (e.g. buckets=[0.01, 0.02, 0.1])
  • to configure summary, it is the exact same, just use LATENCY = Summary('......)

Implementing gauge metric in your Python code (example)

# track the number of active requests getting processed at the moment
IN_PROGRESS = Gauge('name', 'Description', labelnames=['path', 'method'])
# before_request will then increment IN_PROGRESS.inc()
# but after_request when it's done, then decrement IN_PROGRESS.dec()

Best practices

  • use snake_case naming, all lowercase, e.g. library_name_unit_suffix
  • first word should be app/library name it is used for
  • next add what is it used for
  • add unit (_bytes) at the end, use unprefixed base units (not microseconds or kilobytes)
  • avoid _count, _sum, _bucket suffixes
  • good examples: process_cpu_seconds, http_requests_total, redis_connection_errors, node_disk_read_bytes_total
  • bad examples: container_docker_restarts, http_requests_sum, nginx_disk_free_kilobytes, dotnet_queue_waiting_time
  • three types of services/apps:
    • online - immediate response is expected (tracking queries, errors, latency etc)
    • offline - no one is actively waiting for response (amount of queue, wip, processing rate, errors etc)
    • batch - similar to offline but regular, needs push gw (time processing, overall runtime, last completion time)

Service Discovery

  • allows Prometheus to dynamically update/populate/remove a list of endpoints to scrape
  • several built-ins: file, ec2, azure, gce, consul, nomad, k8s...
  • in the Web ui: "status" - "service discovery"

File SD

  • list of jobs/targets can be imported from a json/yaml file(s)
  • example #1:
    scrape_configs:
      - job_name: file-example
        file_sd_configs:
          - files:
            - file-sd.json
            - '*.json'
  • then the file-sd.json would look like e.g.:
    [
      {
        "targets": [ "node1:9100", "node2:9100" ],
        "labels": {
          "team": "dev",
          "job": "node"
        }
      }
    ]

AWS

  • just need to configure EC2 discovery in the config:
    scrape_configs:
      - job_name: ec2
        ec2_sd_configs: # IAM with at least AmazonEC2ReadOnly policy
          - region: <region>
            access_key: <access key>
            secret_key: <secret key>
  • automatically extracts metadata for each EC2 instance
  • defaults to using private IPs

Re-labeling

  • classify Prometheus targets & metrics by rewriting their label set
  • e.g. rename instance from node1:9100 to just node1, drop metrics, drop labels etc
  • 2 options:
    • relabel_configs (in Prometheus.yml) which occurs before scrape and only has access to labels added by SD mechanism
    • metric_relabel_configs (in Prometheus.yml) which occurs after the scrape

Examples - relabel_configs

  • example #1: __meta_ec2_tag_env = dev | prod
    - job_name: aws
      relabel_configs:
        - source_labels: [__meta_ec2_tag_env] # array of labels to match on
          regex: prod                         # to match on specific value of that label
          action: keep|drop|replace           # keep=continue to scrape BUT in that case if regex is not match it will NOT be scraped (there is implicit invisible catchall at the end!), drop=no longer scrape this target
  • example #2: when there are more than 1 source labels (array) they will be joined by a ;
    relabel_configs:
    - source_labels: [env, team]  # if the target has {env=dev} and {team=marketing}, we will keep it
      regex: dev;marketing
      action: keep                # everything else will be dropped
      # separator: "-"            # optional, change the delimiter between labels use the separate property
  • target labels = labels that are added to the labels of every time series returned from a scrape, relabeling will drop all auto-discovered labels (starting with __). In other words, target labels are assigned to every metric from that specific target. Discovered labels are labels that start with a __ will be dropped after the initial relabeling process and will not get assigned as target labels.
  • example #3 of saving __address__=192.168.1.1:80 label in target label, but need to transform into {ip=192.168.1.1}
    relabel_configs:
      - source_labels: [__address__]
        regex: (.*):.*    # assign everything before the `:` into a group referenced with `$1` below
        target_label: ip  # name of the new label
        action: replace
        replacement: $1
  • example #4 of combining labels env="dev" & team="web" will turn into info="web-dev"
    relabel_configs:
      - source_labels: [team, env]
        regex: (.*);(.*)  # parenthesis allow you to use the values as $ below
        action: replace
        target_label: info
        replacement: $1-$2
  • example #5 Re-label so the label team name changes to the organization and the value gets prepended with org-
    relabel_configs:
    - source_labels: [team]
      regex: (.*)
      action: replace
      target_label: organization
      replacement: org-$1
  • to drop the label, use action: labeldrop based on the regex:
    - regex: size
      action: labeldrop
  • the opposite of labeldrop is labelkeep - but keep in mind ALL other labels will be dropped!
    - regex: instance|job
      action: labelkeep
  • to modify the label name (not the value), use labelmap like this:
    - regex: __meta_ec2_(.*)  # match any of these ec2 discovered labels - e.g. __meta_ec2_ami="ami-abcdefgh123456"
      action: labelmap
      replacement: ec2_$1     # we will prepend it with `ec2` - e.g. ec2_ami="ami-abcdefgh123456"

Examples - metric_relabel_configs

  • takes place after the perform the scrape and has access to scraped metrics (not just the labels)
  • configuration is identical to relabel_configs
  • example #1:
    - job_name: example
      metric_relabel_configs: # this will drop a metric http_errors_total
        - source_labels: [__name__]
          regex: http_errors_total
          action: drop        # or keep, which will drop EVERY other metrics
  • example #2:
    - job_name: example
      metric_relabel_configs: # rename a metric name from http_errors_total to http_failures_total
        - source_labels: [__name__]
          regex: http_errors_total
          action: replace
          target_label: __name__            # whats the new name of the label key
          replacement: http_failures_total  # replacement is the new name of the value / the name of the metric
  • example #3:
    - job_name: example
      metric_relabel_configs: # drop a label named code
        - regex: code
          action: labeldrop   # drop a label for a metric
  • example #4:
    - job_name: example
      metric_relabel_configs: # strips of the forward slash and rename {path=/cars} -> {endpoint=cars}. Keep in mind there will now be a path as well as an endpoint. Use drop to get rid of the label path showing the same information.
        - source_labels: [path]
          regex: \/(.*)       # any text after the forward slash (wrapping it in parenthesis gives you access with $)
          action: replace
          target_label: endpoint
          replacement: $1     # match the original value

Push Gateway

  • By default, Pushgateway listens to port 9091
  • when process is already exited before the scrape occurred
  • middle man between batch job and Prometheus server
  • Prometheus will scrape metrics from the PG
  • installation:
    1. pushgateway-1.4.3.linux-amd64.tar.gz from the releases page, untar, run ./pushgateway
    2. create a new user sudo useradd --no-create-home --shell /bin/false pushgateway
    3. copy the binary to /usr/local/bin, change owner to pushgateway, configure service file (same as the Prometheus)
    4. systemctl daemon-reload, restart, enable
    5. Test curl localhost:9091/metrics
  • configure Prometheus to scrape gateway. Same as other targets, but needs additional honor_labels: true (allows the metrics to specify custom labels like job1, job2 etc)
  • for sending the metrics, you send via HTTP POST request: http://<pushgateway_addr>:<port>/metrics/job/<job_name>/<label1>/<value1>/<label2>/<value2>... where job_name will be the job label of the metrics pushed, labels/values paths used as a grouping key, allows for grouping metrics together to update/delete multiple metrics at once. When sending a POST request, only metrics with the same name as the newly pushed, are replaced (this only applies to metrics in the same group):
    1. see the original metrics:
    processing_time_seconds{quality="hd"} 120
    processed_videos_total{quality="hd"} 10
    processed_bytes_total{quality="hd"} 4400
    
    1. POST the processing_time_seconds{quality="hd"} 999
    2. result:
    processing_time_seconds{quality="hd"} 999
    processed_videos_total{quality="hd"} 10
    processed_bytes_total{quality="hd"} 4400
    
  • example: push metric example_metric 4421 with a job label of {job="db_backup"}:
    # ('@-' tells curl to read the binary data from stdin)
    echo "example_metric 4421 | curl --data-binary @-http://localhost:9091/metrics/job/db_backup
  • another example with sending multiple metrics at once:
    cat <<EOF | curl --data-binary @- http://localhost:9091/metrics/job/video_processing/instance/mp4_node1
    processing_time_seconds{quality="hd"} 120
    processed_videos_total{quality="hd"} 10
    processed_bytes_total{quality="hd"} 4400
    EOF
  • When using HTTP PUT request however, the behavior is different. All metrics within a specific group get replaced by the new metrics being pushed (deletes preexisting):
    1. start with:
    processing_time_seconds{quality="hd"} 999
    processed_videos_total{quality="hd"} 10
    processed_bytes_total{quality="hd"} 4400
    
    1. PUT the processing_time_seconds{quality="hd"} 666
    2. result:
    processing_time_seconds{quality="hd"} 666
    
  • HTTP DELETE request will delete all metrics within a group (not going to touch any metrics in the other groups): curl -X DELETE http://localhost:9091/metrics/job/archive/app/web will only delete all with {app="web"}

Client library

  • Python: from prometheus_client import CollectorRegistry, pushadd_to_gateway, then initialize registry = CollectorRegistry(). You can then push via pushadd_to_gateway('user2:9091', job='batch', registry=registry)
  • 3 functions within a library to push metrics:
    • push - same as HTTP PUT (any existing metrics for this job are removed and the pushed metrics added)
    • pushadd - same as HTTP POST (overrides existing metrics with the same names, but all other metrics in group remain unchanged)
    • delete - same as HTTP DELETE (all metrics for a group are removed)

Alerting

  • let's you define condition that if met trigger alerts
  • these are standard PromQL expressions (e.g. node_filesystem_avail_bytes < 1000 = 547)
  • Prometheus is only responsible for triggering alerts
  • responsibility of sending notification is offloaded onto alertmanager -> Slack, email, SMS etc.
  • alerts are visible in the web gui under "alerts" and they are green if not alerting
  • alerting rules are similar to recording rules, in fact they are in the same location (rule_files in prometheus.yaml):
    groups:
      - name: node
        interval: 15s
        rules:
          - record: ...
            expr: ...
          - alert: LowMemory
            expr: node_memory_memFree_percent < 20
  • The for clause tells Prometheus that an expression must evaluate true for specific period of time:
    - alert: node down
      expr: up{job="node"} == 0
      for: 5m   # expects the node to be down for 5 minutes before firing an alert
  • 3 alert states:
    1. inactive - has not returned any results [green]
    2. pending - it hasn't been long enough to be considered firing (related to for) [orange]
    3. firing - active for more than the defined for clause [red]

Labels & Annotations

  • optional labels can be added to alerts to provide a mechanism to classify and match alerts
  • important, because they can be used when you set up rules in the alert manager so you can match on these and group them together
- alert: node down
  expr: ...
  labels:
    severity: warning
- alert: multiple nodes down
  expr: ...
  labels:
    severity: critical
  • annotations (use Go templating) can be used to provide additional/descriptive information (unlike labels they do not play a part in the alerts identity)
- alert: node_filesystem_free_percent
  expr: ...
  annotations:
    description: "Filesystem {{.Labels.device}} on {{.Labels.instance}} is low on space, current available space is {{.Value}}"

This is how the templating works:

  • {{.Labels}} to access alert labels
  • {{.Labels.instance}} to get instance label
  • {{.Value}} to get the firing sample value

Alertmanager

  • By default, Alertmanager is running on port 9093
  • responsible for receiving alerts generated by Prometheus and converting them to notifications
  • supports multiple Prometheus servers via API
  • workflow:
    1. dispatcher picks up the alerts first,
    2. inhibition allows suppress certain alerts if other alerts exist,
    3. silencing mutes alerts (e.g. maintenance)
    4. routing is responsible what alert gets to send where
    5. notification integrates with all 3rd party tools (email, Slack, SMS, etc.)
  • installation:
    1. tarball (alertmanager-0.24.0.linux-amd64.tar.gz) contains alertmanager binary, alertmanager.yml config file, amtool command line utility and data folder where the notification states are stored
    2. The installation is the same as previous tools (add new user, create /etc/alertmanager, create /var/lib/alertmanager, copy executables to /usr/local/bin, change ownerships, create service file, daemon-reload, start, enable). ExecStart in systemd expects --config.file and --storage.path!
    3. starting is simple ./alertmanager and listens on 9093 (you can see the interface on https://localhost:9093)
    4. restarting AM can be done via HTTP POST to /-/reload endpoint, systemctl restart alertmanager or killall -HUP alertmanager
  • configure Prometheus to use that alertmanager:
    global: ...
    alerting:
      alertmanagers:
        - static_configs:
            - targets:
                - 127.0.0.1:9093
                - alertmanager2:9093
  • there are 3 main sections of alertmanager.yml:
    1. global - applies across all sections which can be overwritten (e.g. smtp_smarthost)
    2. route - set of rules to determine what alerts get matched up (match_re, matchers) with what receiver
    • at the top level, there is a default route - any alerts that don't match any of the other routes will use this default, example route:
    route:
      routes:
        - match_re:               # regular expression
            job: (node|windows)
          receiver: infra-email
        - matchers:               # all alerts with job=kubernetes & severity=ticket labels will match this rule
            job: kubernetes
            severity: ticket
          receiver: k8s-slack     # they will be send to this receiver
    • nested routes / subroutes are also supported:
    routes:
    - matchers:                   # parent route
        job: kubernetes           # 2. all other alerts with this label will match this main route (k8s-email)
      receiver: k8s-email
      routes:                     # sub-route for further route matching (logical AND)
        - matchers:
            severity: pager       # 1. if the alert has also label severity=pager, then it will be send to k8s-pager
          receiver: k8s-pager
    • if you need an alert to match two routes, use continue:
    route:
      routes:
        - receiver: alert-logs    # all alerts to be sent to alert-logs
          continue: true
        - matchers:
            job: kubernetes       # AND then if it also has this label job=kubernetes, it will be also sent to k8s-email
          receiver: k8s-email
    • grouping allows to split up your notification by labels (otherwise all alerts results in one big notification):
    receiver: fallback-pager
    group_by: [team]
    routes:
      - matchers:
          team: infra
        group_by: [region,env]    # infra team has alerts grouped based on region and env labels
        receiver: infra-email
        # any child routes underneath here will inherit the grouping policy and group based on same 2 labels region, env
    1. receivers - one or more notifiers to forward alerts to users (e.g. slack_configs)
    • make use of global configurations so all of the receivers don't have to manually define the same key:
    global:
      victorops_api_key: XXX      # this will be automatically provided to all receivers below
    receivers:
      - name: infra-pager
        victorops_configs:
          - routing_key: some-route-here
    • you can customize the message by using Go templating:
      • GroupLabels (e.g. title: in slack_configs: {{.GroupLabels.severity}} alerts in region {{.GroupLabels.region}})
      • CommonLabels
      • CommonAnnotations
      • ExternalURL
      • Status
      • Receiver
      • Alerts (e.g. text: in slack_configs: {{.Alerts | len}} alerts:)
        • Labels
        • Annotations ({{range .Alerts}}{{.Annotations.description}}{{"\n"}}{{end}})
        • Status
        • StartsAt
        • EndsAt
  • Example alertmanager.yml config:
    global:
      smtp_smarthost: 'localhost:25'
      smtp_from: 'alertmanager@prometheus-server.com'
    route:
      group_by: ['alertname']
      group_wait: 10s
      group_interval: 2m
      repeat_interval: 1h
      receiver: 'general-email'
      routes:
        - matchers:
                - team=global-infra
          receiver: global-infra-email
        - matchers:
                - team=internal-infra-email
          receiver: internal-infra-email
    receivers:
      - name: 'web.hook'
        webhook_configs:
          - url: 'http://127.0.0.1:5001/'
      - name: global-infra-email
        email_configs:
                - to: root@prometheus-server.com
                  require_tls: false
      - name: internal-infra-email
        email_configs:
                - to: admin@prometheus-server.com
                  require_tls: false
      - name: general-email
        email_configs:
                - to: admin@prometheus-server.com
                  require_tls: false

Silences

  • alerts can be silence to prevent generating notifications for a period of time (like maintenance windows)
  • in the "new silence" button - specify start, end/duration, matchers (list of labels), creator, comment
  • you can then view those in the "silence" tab

Monitoring Kubernetes

  • for both applications & clusters (control plane components, kubelet/cAdvisor, kube-state-metrics, node-exporter)
  • deploy Prometheus as close to targets as possible
  • make use of preexisting Kube infrastructure
  • to get access to cluster level metrics, we need kube-state-metrics
  • every host should run node-exporter on every node (DaemonSet)
  • make use of service discovery via Kube API

Installation via Helm chart

  1. source: https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack
  2. makes use of the Prometheus Operator (https://github.com/prometheus-operator/prometheus-operator)
  3. couple of custom resources (CRD): Prometheus, Prometheus Rule, Alermanager Config, ServiceMonitor, PodMonitor
  4. Add Helm repo: helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
  5. Update Helm repo: helm repo update
  6. Export all possible values: helm show values prometheus-community/kube-prometheus-stack > values.yaml
  7. Install the chart: helm install prometheus-community/kube-prometheus-stack
  8. (Optional kubectl patch ds prometheus-prometheus-node-exporter --type "json" -p '[{"op": "remove", "path" : "/spec/template/spec/containers/0/volumeMounts/2/mountPropagation"}]' - might need this due to node-exporter bug)
  • What does it do?
    • installs 2 StatefulSets (AM, Prometheus), 3 Deployments (Grafana, kube-prometheus-operator, kube-state-metrics), 1 DaemonSet (node-exporter)
    • SD can discover node, service, pod, endpoint (discovers targets from listed endpoints of a service. For each endpoint address one target is discovered per port. If the endpoint is backed up by a pod, all additional container ports of the pod, not bound to an endpoint port, are discovered as targets as well)

Monitor K8s Application

  • once you have application deployed and listening on some port (i.e. 3000), you can change the Prometheus value additionalScrapeConfigs in the Helm chart and upgrade via helm upgrade prometheus prometheus-community/kube-prometheus-stack -f new-values.yaml (this is less ideal option, it is better to use service monitors to apply new scrapes more declaratively)
  • instead, look at CRDs: kubectl get crd, specifically prometheuses, servicemonitors (set of targets to monitor and scrape, they allow to avoid touching config directly and give you a declarative Kube syntax to define targets)
  • if you want to scrape e.g. service named api-service exposing metrics on /swagger-stats/metrics, use:
    apiVersion: monitoring.coreos.com/v1
    kind: ServiceMonitor
    metadata:
      name: api-service-monitor
      labels:
        release: prometheus # default label that is used by serviceMonitorSelector - it dynamically discovers it
        app: prometheus
    spec:
      jobLabel: job       # look for label job in the Service and take the value
      endpoints:
        - interval: 30s   # equivalent of scrape_interval
          port: web       # matches up with the port 3000 in the Service definition
          path: /swagger-stats/metrics  # equivalent of metrics_path (path where the metrics are exposed)
      selector:
        matchLabels:
          app: service-api
  • but also look at kind: Prometheuses and what is under serviceMonitorSelector (e.g. matchLabels: release: prometheus) - this label allows Prometheus fo find service monitors in the cluster and register them so that it can start scraping the app the service monitor is pointing to (can confirm via Web UI - Status - Configuration)
  • to add rules, use CRD called PrometheusRule - e.g.:
    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
      labels:
        release: prometheus   # similar to ServiceMonitor, to add the rule dynamically
      name: api-rules
    spec:
      groups:
        - name: api
          rules:
            - alert: down
              expr: up == 0
              for: 0m
              labels:
                severity: critical
              annotations:
                summary: Prometheus target missing {{$labels.instance}}
  • to add AM rules, use CRD called AlertmanagerConfig - e.g.:
    apiVersion: monitoring.coreos.com/v1alpha1
    kind: AlertmanagerConfig
    metadata:
      name: alert-config
      labels:
        resource: prometheus  # once again, must match alertmanagerConfigSelector - BUT Helm chart does not specify a label, so you need to update this value yourself!
    spec:
      route:
        groupBy: ["severity"]
        groupWait: 30s
        groupInterval: 5m
        repeatInterval: 12h
        receiver: "webhook"
      receivers:
        - name: "webhook"
          webhookConfigs:
            - url: "http://example.com/"
  • table - keep in mind the differences between a standard AM and K8s one:
Standard Kubernetes
group_by groupBy
group_wait groupWait
group_interval groupInterval
repeat_interval repeatInterval
matchers job: kubernetes matchers name: job, value: kubernetes

Conclusion

Default ports:

Component Port number
prometheus 9090
node-exporter 9100
push gateway 9091
alertmanager 9093

Author: @luckylittle

Last update: Wed Jan 25 05:22:25 UTC 2023

@luckylittle
Copy link
Author

Created a bit.ly shortened link that points here:
https://bit.ly/Prometheus-Certified-Associate

@Mariscal6
Copy link

thanks for the summary, really useful :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment