luckylittle/Mock Exam 1.md

## Mock Exam 1.md

      
    Raw
  

              Mock Exam 1.md
            
          
    Mock Exam

1

Q1. The metric node_cpu_temp_celcius reports the current temperature of a nodes CPU in celsius. What query will return the average temperature across all CPUs on a per node basis? The query should return {instance=“node1”} 23.5 //average temp across all CPUs on node1 {instance=“node2”} 33.5 //average temp across all CPUs on node2.
node_cpu_temp_celsius{instance="node1", cpu="0"} 28
node_cpu_temp_celsius{instance="node1", cpu="1"} 19
node_cpu_temp_celsius{instance="node2", cpu="0"} 36
node_cpu_temp_celsius{instance="node2", cpu="1"} 31

A1: `avg by(instance) (node_cpu_temp_celsius)
Q2: What method does Prometheus use to collect metrics from targets?
A2: pull
Q3: An engineer forgot to address an alert, based off the alertmanager config below, how long will they need to wait to see the alert again?
route:
  receiver: pager
  group_by: [alertname]
  group_wait: 10s
  repeat_interval: 4h
  group_interval: 5m
  routes:
    - match:
        team: api
      receiver: api-pager
    - match:
        team: frontend
      receiver: frontend-pager
A3: 4h
Q4: Which query below will get all time series for metric node_disk_read_bytes_total for job=web, and job=node?
A4: node_disk_read_bytes_total{job=~"web|node"}
Q5: What type of database does Prometheus use?
A5: Time Series
Q6: Analyze the alertmanager configs below. For all the alerts that got generated, how many total notifications will be sent out?
route:
  receiver: general-email
  group_by: [alertname]
  routes:
    - receiver: frontend-email
      group_by: [env]
      matchers:
        - team: frontend

The following alerts get generated by Prometheus with the defined labels.
alert1
team: frontend
env: dev

alert2team: frontend
env: dev

alert3
team: frontend
env: prod

alert4
team: frontend
env: prod

alert5
team: frontend
env: staging
A6: 3
Q7: What is the Prometheus client library used for?
A7: Instrumenting applications to generate prometheus metrics and to push metrics to the Push Gateway
Q8: Management has decided to offer a file upload service where the SLO states that 97% of all upload should complete within 30s. A histogram metric is configured to track the upload time, which of the following bucket configurations is recommended for the desired SLO?
A8: 10, 25, 27, 30, 32, 35, 49, 50
[since histogram quantiles are approximations, to find out if a SLO has been met make sure that a bucket is specified at the desired SLO value]
Q9: Which of the following is not a valid method for reloading alertmanager configuration?
A9: hit the reload config button in alertmanager web ui
Q10: What two labels are assigned to every metric by default?
A10: instance, job
Q11: What configuration will make it so Prometheus doesn’t scrape targets with a label of team: frontend?
#Option A:
relabel_configs:
  - source_labels: [team]
    regex: frontend
    action: drop

#Option B:
relabel_configs:
  - source_labels: [frontend]
    regex: team
    action: drop

#Option C:
metric_relabel_configs:
  - source_labels: [team]
    regex: frontend
    action: drop

#Option D:
relabel_configs:
  - match: [team]
    regex: frontend
    action: drop
A11: Option A
[relabel_configs is where you will define which targets Prometheus should scrape]
Q12: Where should alerting rules be defined?
scrape_configs:
  - job_name: example
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: database_errors_total
        action: replace
        target_label: __name__
        replacement: database_failures_total
A12: separate rules file
Q13: Which query below will give the 99% quantile of the metric http_requests_total?
A13: histogram_quantile(0.99, http_requests_total_bucket)
Q14: What metric should be used to track the uptime of a server?
A14: counter
Q15: Which component of the Prometheus architecture should be used to collect metrics of short-lived jobs?
A15: push gateway
Q16: What is the purpose of Prometheus scrape_interval?
A16: Defines how frequently to scrape a target
Q17: What does the following metric_relabel_config do?
scrape_configs:
  - job_name: example
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: database_errors_total
        action: replace
        target_label: __name__
        replacement: database_failures_total
A17: Renames the metric database_errors_total to database_failures_total
Q18: Which component of the Prometheus architecture should be used to automatically discover all nodes in a Kubernetes cluster?
A18: service discovery
Q19: For a histogram metric, what are the different submetrics?
A19: __count [total number of observations], __bucket [number of observations for a specific bucket], __sum [sum of all observations]
Q20: What is the default web port of Prometheus?
A20: 9090
Q21: Add an annotation to the alert called description that will print out the message that looks like this Instance has low disk space on filesystem, current free space is at %
groups:
  - name: node
    rules:
      - alert: node_filesystem_free_percent
        expr: 100 * node_filesystem_free_bytes{job="node"} / node_filesystem_size_bytes{job="node"} < 10

## Examples of the two metrics used in the alert can be seen below.

# node_filesystem_free_bytes{device="/dev/sda3", fstype="ext4", instance="node1", job="web", mountpoint="/home"}

# node_filesystem_size_bytes{device="/dev/sda3", fstype="ext4", instance="nodde1", job="web", mountpoint="/home"}

# Choose the correct answer:
# Option A:
description: Instance << $Labels.instance >> has low disk space on filesystem << $Labels.mountpoint >>, current free space is at << .Value >>%

# Option B:
description: Instance {{ .Labels.instance }} has low disk space on filesystem {{ .Labels.mountpoint }}, current free space is at {{ .Value }}%

# Option C:
description: Instance {{ .Labels=instance }} has low disk space on filesystem {{ .Labels=mountpoint }}, current free space is at {{ .Value }}%

# Option D:
description: Instance {{ .instance }} has low disk space on filesystem {{ .mountpoint }}, current free space is at {{ .Value }}%
A21: Option B
Q22: What does the double underscore __ before a label name signify?
A22: The label is reserved label
Q23: The metric http_errors_total has 3 labels, path, method, error. Which of the following queries will give the total number of errors for a path of /auth, method of POST, and error code of 401?
A23: http_errors_total{path="/auth", method="POST", code="401"}
Q24: What are the different states a Prometheus alert can be in?
A24: inactive, pending, firing
Q25: Which of the following components is responsible for collecting metrics from an instance and exposing them in a format Prometheus expects?
A25: exporters
Q26: Which of the following is not a valid time value to be used in a range selector?
A26: 2mo
Q27: Analyze the example alertmanager configs and determine when an alert with the following labels arrives on alertmanager, what receiver will it send the alert to team: api and severity: critical?
route:
  receiver: general-email
  routes:
    - receiver: frontend-email
      matchers:
        - team: frontend
      routes:
        - matchers:
            severity: critical
          receiver: frontend-pager
    - receiver: backend-email
      matchers:
        - team: backend
      routes:
        - matchers:
            severity: critical
          receiver: backend-pager
    - receiver: auth-email
      matchers:
        - team: auth
      routes:
        - matchers:
            severity: critical
          receiver: auth-pager
  receiver: auth-pager
A27: general-email
Q28: A metric to track requests to an api http_requests_total is created. Which of the following would not be a good choice for a label?
A28: email
Q29: Which query below will return a range vector?
A29: node_boot_time_seconds[5m]
Q30: Based off the metrics below, which query will return the same result as the query database_write_timeouts / ignoring(error) database_error_total
database_write_timeouts{instance="db1", job="db", error="212, type="mysql"} 12
database_error_total{instance="db1", job="db", type="mysql"} 67

A30: database_write_timeouts / on(instance, job, type) database_error_total
Q31: What is the purpose of the for attribute in a Prometheus alert rule?
A31: Determines how long a rule must be true before firing an alert
Q32: Which query will give sum of all filesystems on the machine? The metric node_filesystem_size_bytes will list out all of the filesystems and their total size.
node_filesystem_size_bytes{device="/dev/sda2", fstype="vfat", instance="192.168.1.168:9100", mountpoint="/boot/efi"} 536834048
node_filesystem_size_bytes{device="/dev/sda3", fstype="ext4", instance="192.168.1.168:9100", mountpoint="/"} 13268975616
node_filesystem_size_bytes{device="tmpfs", fstype="tmpfs", instance="192.168.1.168:9100", mountpoint="/run"} 727924736
node_filesystem_size_bytes{device="tmpfs", fstype="tmpfs", instance="192.168.1.168:9100", mountpoint="/run/lock"} 5242880
node_filesystem_size_bytes{device="tmpfs", fstype="tmpfs", instance="192.168.1.168:9100", mountpoint="/run/snapd/ns"} 727924736
node_filesystem_size_bytes{device="tmpfs", fstype="tmpfs", instance="192.168.1.168:9100", mountpoint="/run/user/1000"} 727920640

A32: sum(node_filesystem_size_bytes{instance="192.168.1.168:9100"})
Q33: What are the 3 components of the prometheus server?
A33: retrieval node, tsdb, http server
Q34: What selector will match on time series whose mountpoint label doesn’t start with /run?
node_filesystem_avail_bytes{device="/dev/sda2", fstype="vfat", instance="node1", mountpoint="/boot/efi"}
node_filesystem_avail_bytes{device="/dev/sda2", fstype="vfat", instance="node2", mountpoint="/boot/efi"}
node_filesystem_avail_bytes{device="/dev/sda3", fstype="ext4", instance="node1", mountpoint="/"}
node_filesystem_avail_bytes{device="/dev/sda3", fstype="ext4", instance="node2", mountpoint="/"}
node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node1", mountpoint="/run"}
node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node1", mountpoint="/run/lock"}
node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node1", mountpoint="/run/snapd/ns"}
node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node1", mountpoint="/run/user/1000"}
node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node2", mountpoint="/run"}
node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node2", mountpoint="/run/lock"}
node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node2", mountpoint="/run/snapd/ns"}
node_filesystem_avail_bytes{device="tmpfs", fstype="tmpfs", instance="node2", mountpoint="/run/user/1000"}

A34: node_filesysten_avail_bytes{mountpoint!~"/run.*"}
Q35: Which statement is true about the rate/irate functions?
A35: rate() calculates average rate over entire interval, irate() calculates the rate only between the last two datapoints in an interval
Q36: What is the default path Prometheus will scrape to collect metrics?
A36: /metrics
Q37: The following PromQL expression is trying to divide the the node_filesystem_avail_bytes by node_filesystem_size_bytes , and node_filesystem_avail_bytes / node_filesystem_size_bytes. The PromQL expression does not return any results, fix the expression so that it successfully divides the two metric. This is what the two metrics look like before the division operation:
node_filesystem_avail_bytes{device="/dev/sda2", fstype="vfat", class=”SSD” instance="192.168.1.168:9100", job="test", mountpoint="/boot/efi"}

node_filesystem_size_bytes{device="/dev/sda2", fstype="vfat", instance="192.168.1.168:9100", job="test", mountpoint="/boot/efi"}

A37: node_filesystem_avail_bytes / ignoring(class) node_filesystem_size_bytes
Q38: What are the 3 components of observability?
A38: logging, metrics, traces
Q39: Which of the following statements are true regarding Alert labels and annotations?
route:
  receiver: staff
  group_by: ['severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  routes:
    - matchers:
        job: kubernetes
      receiver: infra
      group_by: ['severity']
A39: Alert labels can be used as metadata so alertmanager can match on them and perform routing policies, whereas annotations should be used for cosmetic descriptions of the alerts
Q40: The metric http_errors_total{code=”404”} tracks the number of 404 errors a web server has seen. Which query returns what is the average rate of 404s a server has seen for the past 2 hours? Use a 2m sample range and a query interval of 1m:
A40: avg_over_time(rate(http_errors_total{code="404"}[2m]) [2h:1m])
[since we need the average for the past 2 hours, the first value in the subquery will be 2h and the second number is the query interval]
Q41: Which query will return all time series for the metric node_network_transmit_drop_total this is greater than 20 and less than 100?
A41: node_network_transmit_drop_total > 20 and node_network_transmit_drop_total < 100
Q42: What does the following metric_relabel_config do?
scrape_configs:
  - job_name: example
    metric_relabel_configs:
      - source_labels: [datacenter]
        regex: (.*)
        action: replace
        target_label: location
        replacement: dc-$1
A42: changes the datacenter label to location and prepends the value with dc-
Q43: What type of data should Prometheus monitor?
A43: numeric
Q44: Which type of observability would be used to track a request/transaction as it traverses a system?
A44: traces
Q45: Add an annotation to the alert called description that will print out the message that looks like this Instance has low disk space on filesystem , current free space is at %
groups:
  - name: node
    rules:
      - alert: node_filesystem_free_percent
        expr: 100 * node_filesystem_free_bytes{job="node"} / node_filesystem_size_bytes{job="node"} < 10

# Examples of the two metrics used in the alert can be seen below
# node_filesystem_free_bytes{device="/dev/sda3", fstype="ext4", instance="node1", job="web", mountpoint="/home"}
# node_filesystem_size_bytes{device="/dev/sda3", fstype="ext4", instance="nodde1", job="web", mountpoint="/home"}

# Choose the correct option:

#Option A:
description: Instance << $Labels.instance >> has low disk space on filesystem << $Labels.mountpoint >>, current free space is at << .Value >>%

#Option B:
description: Instance {{ .Labels.instance }} has low disk space on filesystem {{ .Labels.mountpoint }}, current free space is at {{ .Value }}%

#Option C:
description: Instance {{ .Labels=instance }} has low disk space on filesystem {{ .Labels=mountpoint }}, current free space is at {{ .Value }}%

#Option D:
description: Instance {{ .instance }} has low disk space on filesystem {{ .mountpoint }}, current free space is at {{ .Value }}%
A45: Option B
Q46: Regarding histogram and summary metrics, which of the following are true?
A46: histogram is calculated server side and summary is calculated client side
[for histograms, quantiles must be calculated server side thus they are less taxin on client libraries, whereas sumary metrics are the opposite]
Q47: What is this an example of? `Service provider guaranteed 99.999% uptime each month or else customer will be awarded $10k’
A47: SLA
Q48: Which of the following is Prometheus’ built in dashboarding/visualization feature?
A48: Console templates
Q49: Which query below will give the active bytes on instance 10.1.1.1:9100 45m ago?
A49: node_memory_Active_bytes{instance="10.1.1.1:9100"} offset 45m
Q50: What type of metric should be used for measuring internal temperature of a server?
A50: gauge
Q51: What is the name of the cli utility that comes with Prometheus?
A51: promtool
Q52: How can alertmanager prevent certain alerts from generating notification for a temporary period of time?
A52: Configuring a silence
Q53: In the scrape configs for a pushgateway, what is the purpose of the honor_labels: true
scrape_configs:
  - job_name: pushgateway
    honor_labels: true
    static_configs:
      - targets: ["192.168.1.168:9091"]
A53: Allows metrics to specify the instance and job labels instead of pulling it from scrape_configs
Q54: Analayze the example alertmanager configs and determine when an alert with the following labels arrives on alertmanager, what receiver will it send the alert to team: backend and severity: critical
route:
  receiver: general-email
  routes:
    - receiver: frontend-email
      matchers:
        - team: frontend
      routes:
        - matchers:
            severity: critical
          receiver: frontend-pager
    - receiver: backend-email
      matchers:
        - team: backend
      routes:
        - matchers:
            severity: critical
          receiver: backend-pager
    - receiver: auth-email
      matchers:
        - team: auth
      routes:
        - matchers:
            severity: critical
          receiver: auth-pager
  receiver: auth-pager
A54: backend-pager
Q55: Which of the following would make for a poor SLI?
A55: high disk utilization
[things like CPU, memory, disk utilization are poor as user may not experience any degradation of service during these events]
Q56: Which of the following is not a valid way to reload Prometheus configuration?
A56: promtool config reload
Q57: Which of the following is not something that is tracked in a span within a trace?
A57: complexity
Q58: You are writing your own exporter for a Redis database. Which of the following would be the correct name for a metric to represent used memory on the by the Redis instance?
A58: redis_mem_used_bytes
[the first should be the app, second metric name, third the unit]
Q59: Which cli command can be used to verify/validate prometheus configurations?
A59: promtool check config
Q60: Which query will return targets who have more than 50 arp entries?
A60: node_arp_entries{job="node"} > 50

  
## Mock Exam 2.md

      
    Raw
  

              Mock Exam 2.md
            
          
    Mock Exam

2

Q1: What data type do Prometheus metric values use?
A1: 64bit floats
Q2: The metric node_fan_speed_rpm tracks the current fan speeds. The location label specifies where on the server the fan is located. Which query will return the fan speeds for all fans except the rear fan
A2: node_fan_speed_rpm{location!="rear"}
Q3: With the following alertmanager configs, after a notification has been sent out, a new alert comes in. How long will alertmanager wait before firing a new notification?
route:
  receiver: staff
  group_by: ['severity']
  group_wait: 60s
  group_interval: 15m
  repeat_interval: 12h
  routes:
    - matches:
job: kubernetes
      receiver: infra
      group_by: ['severity']
A3: 15m
[the group_interval property determines how long alertmanager will wait after sending a notification ebfore it sends a new notification for a group]
Q4: What is the purpose of Prometheus scrape_interval?
A4: defines how frequently to scrape a target
Q5: The metric http_requests tracks the total number of requests across each endpoint and method. What query will return the total number of requests for each path
http_requests{method="get", path="/auth"} 3
http_requests{method="post", path="/auth"} 1
http_requests{method="get", path="/user"} 4
http_requests{method="post", path="/user"} 8
http_requests{method="post", path="/upload"} 2
http_requests{method="get", path="/tasks"} 4
http_requests{method="put", path="/tasks"} 6
http_requests{method="post", path="/tasks"} 1
http_requests{method="get", path="/admin"} 3
http_requests{method="post", path="/admin"} 9

A5: sum by(path) (http_requests)
Q6: An application is advertising metrics at the path /monitoring/stats. What property in the scrape configs needs to be modified?
A6: metrics_path: "/monitoring/stats"
Q7: Analyze the alertmanager configs below : Based off the alert below, which receiver will send the notification for the alert alert labels: team: frontend
route:
  group_wait: 20s
  receiver: general
  group_by: ['alertname']
  routes:
    - match:
        org: kodekloud
      receiver: kodekloud-pager
    - match:
        org: apple
      receiver: apple
A7: general
Q8: What type of database does Prometheus use?
A8: Time-series database
Q9: Which of the following is Prometheus’ built in dashboarding/visualization feature?
A9: Console templates
Q10: What command should be used to verify that a Prometheus config is valid?
A10: promtool check config prometheus.yml
Q11: What type of data should prometheus monitor?
A11: numeric
Q12: What is the default port that Prometheus listens on?
A12: 9090
Q13: A car reports the number of miles it has been driven with the metric car_total_miles Which query returns what is the average rate of miles the car has driven the past 2 hours. Use a 4m sample range and a query interval of 1m.
A13: avg_over_time(rate(car_total_miles[4m]) [2h:1m])
Q14: Groups and rules within a group are run sequentially
A14: Alert labels can be used as metadata so alertmanager can match on them and perform routing policies, annotations should be used for cosmetic descriptions of the alerts
Q15: What method does Prometheus use to collect metrics from targets?
A15: pull
Q16: Which of the following is not a form of observability?
A16: streams
Q17: How is application instrumentation achieved?
A17: Client libraries
Q18: Which query below will give the 95% quantile of the metric http_file_upload_bytes?
A18: histogram_quantile(0.95, http_file_upload_bytes_bucket)
Q19: What is this an example of 99% availability with a median latency less than 300ms?
A19: SLO
Q20: What is the default path Prometheus will scrape to collect metrics?
A20: /metrics
Q21: Where are alert rules defined?
A21: In a separate rules file on the Prometheus server
Q22: kafka_topic_partition_replicas metric tracks the number of partitions for a topic/partition. Which query will get the number of partitions for the past 2 hours. Result should return a range vector.
A22: kafka_topic_partition_replicas[2h]
Q23: The metric http_errors_total has 3 labels: path, method, error. Which of the following queries will give the total number of errors for a path of /auth, method of POST, and error code of 401?
A23: http_errors_total{path="/auth", method="POST", code="401"}
Q24: What update needs to occur to add an annotation called description that prints out the message redis server <insert instance name> is down!
A24: description: "redis server {{.Labels.instance}} is down!"
Q25: Which statement is true regarding Prometheus rules?
A25: Groups are run in parallel, and rules within a group are run sequentially
Q26: What does the following config do?
scrape_configs:
  - job_name: "demo"
    metric_relabel_configs:
      - regex: fstype
        action: labeldrop
A26: The label fstype will be dropped for all metrics
Q27: The metric node_filesystem_avail_bytes reports the available bytes for each filesystem on a node. Which query will return all filesystems that has either less than 1000 available bytes or greater than 50000 bytes
A27: node_filesystem_avail_bytes < 1000 or node_filesystem_avail_bytes > 50000
Q28: For metric_relabel_configs and relabel_configs, when matching on multiple source labels, what is the default delimiter
A28: ;
Q29: Which of the following is not a valid method for reloading alertmanager configuration?
A29: hit the reload config button in alertmanager web ui
Q30: Which of the following components is responsible for receiving metrics from short lived jobs?
A30: pushgateway
Q31: For a histogram metric, what are the different submetrics?
A31: __count, __bucket, __sum
Q32: Which query will return whether or not a target is currently able to be scraped?
A32: up
Q33: What does the double underscore __ before a label name signify?
A33: The label is reserver label
Q34: Which configuration in alertmanager will wait 2 minutes before firing off an alert to prevent unnecessary notifications getting sent?
A34: group_wait: 2m
[when an alert arrives on alertmanager, it will wait for the amoount of time specified in group_wait to wait for other alerts to arrive before firing off a notification]
Q35: Which of the following is not a component of the Prometheus solution?
A35: influxdb
Q36: Which component of the Prometheus architecture should be used to automatically discover all nodes in a Kubernetes cluster?
A36: service discovery
Q37: The metric mealplanner_consumed_calories tracks the number of calories that have been consumed by the user. What query will return the amount of calories that had been consumed 4 days ago?
A37: mealplanner_consumed_calories offset 4d
Q38: Which of the following would make for a good SLI?
A38: request failures
[For good SLIs metrics, use metrics that impact the user's experience. Disk utilization, memory utilization, fan speed, and server temperature are not things that impact the user. Request failures will impact a user’s experience for sure]
Q39: What does the following config do?
scrape_configs:
  - job_name: "demo"
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: docker_container_crash_total
        action: replace
        target_label: __name__
        replacement: docker_container_restart_total
A39: Renames the metric docker_container_crash_total to docker_container_restart_total
Q40: What type of metric should be used to track the number of miles a car has driven?
A40: counter
Q41: What type of metric should be used for measuring a users heart rate?
A41: gauge
Q42: What is the purpose of repeat_interval in alertmanager?
A42: How long to wait before sending a notification again if it has already been sent successfully for an alert
Q43: Which of the following components is responsible for collecting metrics from an instance and exposing them in a format Prometheus expects?
A43: exporters
Q44: What are the two attributes that metrics can have?
A44: TYPE, HELP
Q45: What query will return all the instances whose active memory bytes is less than 10000?
A45: node_memory_Active_bytes < 10000
Q46: How many labels does the following time series have node_fan_speed{instance=“node8”, job=“server”, fan=“2”}?
A46: 3
Q47: In the prometheus configuration, what is the purpose of the scheme field?
A47: Determines if Prometheus will use HTTP or HTTPS
Q48: The metric health_consumed_calories tracks how many calories a user has eaten and health_burned_calories tracks the number of calories burned while exercising. To calculate net calories for the day subtract health_burned_calories from health_consumed_calories. Based on the time series below, which expression successfully calculates net calories.
health_consumed_calories{job=“health”, meal=“dinner”} 800
health_burned_calories{job=“health”, activity=“cardio”} 200

A48: health_consumed_calories - ignoring(meal, activity) health_burned_calories
Q49: What does the following config do?
scrape_configs:
 - job_name: example
   relabel_configs:
    - source_labels: [env, team]
      regex: dev;marketing
      action: drop
A49: Drops all targets whose env label is set to dev and team label is set to marketing
Q50: What is the name of the Prometheus query language?
A50: PromQL
Q51: You are writing an exporter for RabbitMQ and are creating a metric to track the size of the message queue. Which of the following would be an appropriate name for the metric.
A51: rabbitmq_message_bytes
Q52: What are the different states a Prometheus alert can be in?
A52: inactive, pending, firing
Q53: Which statement is true about the rate/irate functions?
A53: rate() calculates average rate over entire interval, irate() calculates the rate only between the last two datapoints in an interval
Q54: What does the following config do?
scrape_configs:
  - job_name: "example"
    metric_relabel_configs:
      - source_labels: [team]
        regex: (.*)
        action: replace
        target_label: organization
        replacement: org-$1
A54: renames the team label to organization and the value of the label will get prepended with org-
Q55: Analayze alertmanager configs below. Based off the following alert which receiver will receive the notification alertname: node_filesystem_full, labels: team: frontend,  notification: pager
route:
  receiver: general-email
  group_by: [alertname]
  routes:
    - receiver: frontend-email
      matchers:
        - team: frontend
      routes:
        - matchers:
            notification: pager
          receiver: frontend-pager
    - receiver: backend-email
      matchers:
        - team: backend
    - receiver: auth-email
      matchers:
        - team: auth
A55: frontend-pager
Q56: A database backup service has an SLO that states that 97% of all backup jobs will be completed within 60s. A histogram metric is configured to track the backup process time, which of the following bucket configurations is recommended for the desired SLO?
A56: 35, 45, 55, 60, 65, 75, 100
[Since histogram quantiles are approximations, to find out if a SLO has been met, make sure that a bucket is specified at the desired SLO value of 60s. The exact number (60s) must be present in the list.]
Q57: Which of the following is not a valid time value to be used in a range selector?
A57: 3hr
Q58: What type of data does Prometheus collect?
A58: numeric
Q59: The node_cpu_seconds_total metric tracks the number of seconds cpu has spent in a specific mode. The metric will break it down per cpu using the cpu label. Which query will return the total time all cpus on an instance spent in a mode that is not idle. Make sure to group the result on a per instance basis.
node_cpu_seconds_total{cpu="0", instance="192.168.1.168:9100", job="test", mode="idle"}
node_cpu_seconds_total{cpu="0", instance="192.168.1.168:9100", job="test", mode="iowait"}
node_cpu_seconds_total{cpu="0", instance="192.168.1.168:9100", job="test", mode="irq"}
node_cpu_seconds_total{cpu="0", instance="192.168.1.168:9100", job="test", mode="nice"}
node_cpu_seconds_total{cpu="0", instance="192.168.1.168:9100", job="test", mode="softirq"}
node_cpu_seconds_total{cpu="0", instance="192.168.1.168:9100", job="test", mode="steal"}
node_cpu_seconds_total{cpu="0", instance="192.168.1.168:9100", job="test", mode="system"}
node_cpu_seconds_total{cpu="1", instance="192.168.1.168:9100", job="test", mode="idle"}
node_cpu_seconds_total{cpu="1", instance="192.168.1.168:9100", job="test", mode="iowait"}
node_cpu_seconds_total{cpu="1", instance="192.168.1.168:9100", job="test", mode="irq"}
node_cpu_seconds_total{cpu="1", instance="192.168.1.168:9100", job="test", mode="nice"}
node_cpu_seconds_total{cpu="1", instance="192.168.1.168:9100", job="test", mode="softirq"}
node_cpu_seconds_total{cpu="1", instance="192.168.1.168:9100", job="test", mode="steal"}
node_cpu_seconds_total{cpu="1", instance="192.168.1.168:9100", job="test", mode="system"}

A59: sumb by(instance) (node_cpu_seconds{mode!="idle"})
Q60: The following time series return values with a lot of decimal values. What query will return values rounded down to the closest integer node_cpu_seconds_total {cpu=“0”, mode=“idle”} 115.12 {cpu=“0”, mode=“irq”} 87.4482 {cpu=“0”, mode=“steal”} 44.245
A60: floor(node_cpu_seconds_total)

  
## Prometheus Certified Associate (PCA).md

      
    Raw
  

              Prometheus Certified Associate (PCA).md
            
          
    Prometheus Certified Associate (PCA)

Curriculum


28% PromQL


Selecting Data
Rates and Derivatives
Aggregating over time
Aggregating over dimensions
Binary operators
Histograms
Timestamp Metrics


20% Prometheus Fundamentals


System Architecture
Configuration and Scraping
Understanding Prometheus Limitations
Data Model and Labels
Exposition Format


18% Observability Concepts


Metrics
Understand logs and events
Tracing and Spans
Push vs Pull
Service Discovery
Basics of SLOs, SLAs, and SLIs


18% Alerting & Dashboarding


Dashboarding basics
Configuring Alerting rules
Understand and Use Alertmanager
Alerting basics (when, what, and why)


16% Instrumentation & Exporters


Client Libraries
Instrumentation
Exporters
Structuring and naming metrics

Observability Fundamentals

Observability


the ability to understand and measure the state of a system based on data generated by the system
allows to generate actionable outputs from unexpected scenarios
to better understand the internals of your system
greater need for observability in distributed systems & microservices
troubleshooting - e.g. why are error rates high?
3 pillars of observability:

Logs     - records of events that have occurred and encapsulate info about the specific event
Metrics  - numerical value/information about the state, data can be aggregated over time, contains name, value, timestamp, dimensions
Traces   - follow operations (trace-id) as they travel through different hops, spans are events forming a trace


Prometheus only handles metrics, not logs or traces!

SLO/SLA/SLI

a. SLI (service level indicators) = quantitative measure of some aspect of the level of service provided (availability, latency, error rate etc.)

not all metrics make for good SLIs, you want to find metrics that accurately measure a user's experience
high CPU, high memory are poor SLIs as they don't necessarily affect user's experience

b. SLO (service level objectives) = target value or range for an SLI

examples:

SLI = Latency
SLO = Latency < 100ms
SLI = Availability
SLO = 99.99% uptime


should be directly related to the customer experience
purpose is to quantify reliability of a product to a customer
may be tempted to set unnecessarily aggressive values
goal is not to achieve perfection, but make customers happy

c. SLA (service level agreement) = contract between a vendor and a user that guarantees SLO
Prometheus Fundamentals


open source monitoring tool that collects metrics data and provide tools to visualize the data
use cases:

collect metrics from different locations (e.g. like West DC, central DC, East DC, AWS etc.)
high memory on the hosting MySQL db and notify operations team via email
find out which uploaded video length the application starts to degrade


allows to generate alerts when threshold reached
collects data by scraping targets who expose metrics through HTTP endpoint
stored in time series db and can be queried with built-in PromQL (Prometheus Query Language)
what can it monitor:

CPU/memory
disk space
service uptime
app specific data - number of exceptions, latency, pending requests
networking devices, databases etc.


exclusively monitor numeric time-series data!
does not monitor events, system logs, traces!
originally sponsored by SoundCloud
written in Go

Prometheus Architecture


3 core components:

Retrieval (scrapes metric data)
TSDB (time-series database stores metric data)
HTTP server (accepts PromQL query)


lots of other components making up the whole solution:

exporters (mini-processes running on the targets), retrieval component pulls the metrics from
pushgateway (short lived job sends the data to it and then retrieved from there)
service discovery is all about providing list of targets so you don't have to hardcode those values
alertmanager handles all of the emails, SMS, slack etc. after the alerts is pushed to it
Prometheus Web UI or Grafana etc.


collects by sending HTTP request to /metrics endpoint of each target, path can be changed via metrics_path
several native exporters:

node-exporters (Linux)
Windows
MySQL
Apache
HAProxy
client libraries to monitor application metrics (# of errors/exceptions, latency, job execution duration) for Go, Java, Python, Ruby, Rust


Pull based approach is better, because:

easier to tell if the target is down
does not DDoS the metrics server
definitive list of targets to monitor (central source of truth)


By default, the Prometheus server will use port 9090.

Prometheus Installation


Download *.tar from http://prometheus.io/download
untarred folder contains console_libraries, consoles, prometheus (binary), prometheus.yml (config) and promtool (CLI utility) + docs
Run ./prometheus - does it work?
Open http://localhost:9090 - does it work?
Execute / query up in the console to see the one target (itself) - should work OK so we can turn it into a systemd service now:

Create a new/separate user: sudo useradd --no-create-home --shell /bin/false prometheus
Create a config folder: sudo mkdir /etc/prometheus
Create folder /var/lib/prometheus for the data
Move executables: sudo cp prometheus /usr/local/bin ; sudo cp promtool /usr/local/bin
Move config file: sudo cp prometheus.yaml /etc/prometheus/
Copy the consoles folder: sudo cp -r consoles /etc/prometheus/ ; sudo cp -r console_libraries /etc/prometheus/
Change owner for these folders & executables: sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus /usr/local/bin/prometheus /usr/local/bin/promtool
The command (ExecStart) in the service file will then look like this: sudo -u prometheus /usr/local/bin/prometheus --config.file /etc/prometheus/prometheus.yaml --storage.tsdb.path /var/lib/prometheus --web.console.templates /etc/prometheus/consoles --web.console.libraries=/etc/prometheus/console_libraries
Create a service file with this information /etc/systemd/system/prometheus.service and reload sudo systemctl daemon-reload
Start the daemon sudo systemctl start prometheus ; sudo systemctl enable prometheus


Node exporter


Download *.tar from http://prometheus.io/download
untarred folder contains basically just the binary node_exporter
The node_exporter listens on HTTP port 9100 by default
Run the ./node_exporter and then curl localhost:9100/metrics
Run in the background & start on boot using the systemd, very similar to Prometheus installation:
sudo cp node_exporter /usr/local/bin
sudo useradd --no-create-home --shell /bin/false node_exporter
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter
sudo vi /etc/systemd/system/node_exporter.service
sudo systemctl daemon-reload
sudo systemctl start node_exporter ; sudo systemctl enable node_exporter


Prometheus configuration


Sections:

global - Default parameters, it can be overridden by the same variables in sub-sections
scrape_configs - Define targets and job_name, which is a collection of instances that need to be scraped
alerting - Alerting specifies settings related to the Alertmanager
rule_files - Rule files specifies a list of globs, rules and alerts are read from all matching files
remote_read & remote_write - Settings related to the remote read/write feature
storage - Storage related settings that are runtime reloadable


Example config:
scrape_configs:
  - job_name: 'nodes'               # call it whatever
    scrape_interval: 30s            # from the target every X seconds
    scrape_timeout: 3s              # timeouts after X seconds
    scheme: https                   # http or https
    metrics_path: /stats/metrics    # non-default path that you send requests to
    static_configs:
      - targets: ['10.231.1.2:9090', '192.168.43.9:9090'] # two IPs
    # basic_auth                    # this is the next section

To reload the config: sudo systemctl restart prometheus

Encryption & Authentication


between the Prometheus and the targets

Encryption


On the targets, you need to generate the key & crt pair first - e.g.:

sudo openssl req -new -newkey rsa:2048 -days 465 -nodex -x509 -keyout node_exporter.key -out node_exporter.crt -subj "..." -addtext "subjectAltName = DNS:localhost"
then target config will have to be customized after that:
# /etc/node_exporter/config.yml
tls_server_config:
  # Certificate and key files for server to use to authenticate to client
  cert_file: node_exporter.crt
  key_file: node_exporter.key

The exporter supports TLS via a new web configuration file: ./node_exporter --web.config=config.yml
Test with: curl -k https://localhost:9100/metrics


On the server, you need:

copy the node_exporter.crt from the target to the Prometheus server
update the scheme to https in the prometheus.yml and add tls_config with ca_file (e.g. /etc/prometheus/node_exporter.crt that we copied in the previous step) and insecure_skip_verify if self-signed:
# /etc/prometheus/prometheus.yaml
scrape_configs:
  - job_name: "node"
    scheme: https
    tls_config:
      # Certificate and key files for client cert authentication to the server
      ca_file: /etc/prometheus/node_exporter.crt
      insecure_skip_verify: true

restart prometheus service


Authentication


Authentication is done via generated hash (sudo apt install apache2-utils or httpd-tools etc.) and then: htpasswd -nBC 12 "" | tr -d ':\n' (will prompt for password and spits out the hash)
add the basic_auth_users and username + generated hash underneath it:
# /etc/node_exporter/config.yml
basic_auth_users:
  prometheus: $2y$12$daXru320983rnofkwehj4039F

restart node_exporter service
update Prometheus server's config with the same auth and restart Prometheus:
- job_name: "node"
  basic_auth:
    username: prometheus
    password: <PLAIN TEXT PASSWORD!>


Metrics


3 properties:

name - general feature of a system to be measured, may contain ASCII, numbers, underscores ([a-zA-Z_:][a-zA-Z0-9_:]*), colons are reserved only for recording rules. Metric names cannot start with a number. Name is technically a label (e.g. __name__=node_cpu_seconds_total)
{labels (key/value pairs)} - allows split up a metric by a specified criteria (e.g. multiple CPUs, specific HTTP methods, API endpoints etc), metrics can have more than 1 label, ASCII, numbers, underscores ([a-zA-Z0-9_]*). Labels surrounded by __ are considered internal to Prometheus. Every metric is assigned 2 labels by default (instance and job).
value of the metric


Example = node_cpu_seconds_total{cpu="0",mode="idle"} 258277.86: labels provide us information on which CPU this metric is for (cpu number zero)
when Prometheus scrapes a target and retrieves metrics, it also stores the time at which the metric was scraped
Example = 1668215300 (unix epoch timestamp, since Jan 1st 1970 UTC)
time series = stream of timestamped values sharing the same metric and set of labels
metric have a TYPE (counter, gauge, histogram, summary) and HELP (description of the metric is) attributes
explanation of each types:

counter can only go up, e.g. how many times did X happened?
gauge can go up or down, e.g. what is the current value of X?
histogram tells how long or how big something is, groups observations into configurable bucket sizes (e.g. accumulative response time buckets <1s, <0.5s, <0.2s)

e.g. request_latency_seconds_bucket{le="0.05"} 50 - Buckets are cumulative (i.e. all request in the le=0.05 bucket will include all requests less than 0.05 which includes all requests that fall into the buckets below it (e.g 0.03, 0.02, 0.01...)
e.g. to calculate the histogram's quantiles, we would use histogram_quantile, approximation of the value of a specific quantile: 75% of all requests have what latency? histogram_quantile(0.75, request_latency_seconds_bucket). To get an accurate value, make sure there is a bucket at the specific value that needs to be met. Every time you add a bucket, it will slow the performance of the Prometheus!


summary is similar to histogram and tells us how many observation fell below X?, do not have to define quantiles ahead of time (similar to histogram, but percentages: response time 20% = <0.3s, 50% = <0.8s, 80% = <1s). Similarly to histogram, there will be _count and _sum metrics as well as quantiles like 0.7, 0.8, 0.9 (instead of buckets).


table - difference:


histogram
summary


bucket sizes can be picked
quantile must be defined ahead of time


less taxing on client libraries
more taxing on client libraries


any quantile can be selected
only quantiles predefined in client can be used


Prometheus server must calculate quantiles
very minimal server-side cost


Quiz:

Q1: How many total unique time series are there in this output?
node_arp_entries{instance="node1" job="node"} 200
node_arp_entries{instance="node2" job="node"} 150
node_cpu_seconds_total{cpu="0", instance="node"1", mode="iowait"}
node_cpu_seconds_total{cpu="1", instance="node"1", mode="iowait"}
node_cpu_seconds_total{cpu="0", instance="node"1", mode="idle"}
node_cpu_seconds_total{cpu="1", instance="node"1", mode="idle"}
node_cpu_seconds_total{cpu="1", instance="node"2", mode="idle"}
node_memory_Active_bytes{instance="node1" job="node"} 419124
node_memory_Active_bytes{instance="node2" job="node"} 55589

A1: 9
Q2: What metric should be used to report the current memory utilization?
A2: gauge
Q3: What metric should be used to report the amount of time a process has been running?
A3: counter
Q4: Which of these is NOT a valid metric?
A4: 404_error_count
Q5: How many labels does the following time series have? http_errors_total{instance=“1.1.1.1:80”, job=“api”, code=“400”, endpoint="/user", method=“post”} 55234
A5: 5
Q6: A web app is being built that allows users to upload pictures, management would like to be able to track the size of uploaded pictures and report back the number of photos that were less than 10Mb, 50Mb, 100MB, 500MB, and 1Gb. What metric would be best for this?
A6: histogram
Q7: What are the two labels every metric is assigned by default?
A7: instance, job
Q8: What are the 4 types of Prometheus metrics?
A8: counter, gauge, histogram, summary
Q9: What are the two attributes provided by a metric?
A9: Help, Type
Q10: For the metric http_requests_total{path=”/auth”, instance=”node1”, job=”api”} 7782; What is the metric name?
A10: http_request_total
Q11: For the http_request_total metric, what is the query/metric name that would be used to get the count of total requests on node node01:3000?
A11: http_request_total_count{instance="node01:3000"}
Q12: Construct a query to return the total number of requests for the /events route with a latency of less than 0.4s across all nodes.
A12: http_request_total_bucket{route="/events",le="0.4"}
Q13: Construct a query to find out how many requests took somewhere between 0.08s and 0.1s on node node02:3000.
A13: ?
Q14: Construct a query to calculate the rate of http requests that took less than 0.08s. Use a time window of 1m across all nodes.
A14: ?
Q15: Construct a query to calculate the average latency of a request over the past 4 minutes. Use the formula below to calculate average latency of request: rate of sum-of-all-requests / rate of count-of-all-requests
A15: ?
Q16: Management would like to know what is the 95th percentile for the latency of requests going to node node01:3000. Construct a query to calculate the 95th percentile.
A16: ?
Q17: The company is now offering customers an SLO stating that, 95% of all requests will be under 0.15s. What bucket size will need to be added to guarantee that the histogram_quantile function can accurately report whether or not that SLO has been met?
A17: 0.15
Q18: A summary metric http_upload_bytes has been added to track the amount of bytes uploaded per request. What are percentiles being reported by this metric?

0.02, 0.05, 0.08, 0.1, 0.13, 0.18, 0.21, 0.24, 0.3, 0.35, 0.4
0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99
events, tickets
200, 201, 400, 404

A18: ?
Expression browser


Web UI for Prometheus server to query data
up - returns which targets are in up state (you can see an instance and job and value on the right - 0 and 1)

Prometheus on Docker


Pull image prom/prometheus
Configure prometheus.yml
Expose ports, bind mounts
Run: docker run -d /path-to/prometheus.yml:/etc/prometheus/prometheus.yml -p 9090:9090 prom/prometheus

PromTools


check & validate configuration before applying (e.g before production)
prevent downtime while config issues are being identified
validate metrics passed to it are correctly formatted
can perform queries on a Prom server
debugging & profiling a Prom server
perform unit tests against Recording/Alerting rules
To check/validate config, run: promtool check config /etc/prometheus/prometheus.yml

Container metrics


metrics can be scraped from containerized envs

Docker engine metrics (how much CPU does Docker use etc. not metrics specific to a container!)


vi /etc/docker/daemon.json:
{
  "metrics-addr": "127.0.0.1:9323",
  "experimental": true
}

sudo systemctl restart docker
curl localhost:9323/metrics
Prometheus job update:
scrape_configs:
  - job_namce: "docker"
    static_configs:
      - targets: ["12.1.13.4:9323"]


cAdvisor (how much memory does each container use? container uptime? etc.)


vi docker-compose.yml to pull gcr.io/cadvisor/cadvisor
docker-compose up or docker compose up
curl localhost:8080/metrics

PromQL


short for Prometheus Query Language
data returned can be visualized in dashboards
used to build alerting rules to notify about thresholds

Data Types


String (currently unused)
Scalar - numeric floating point value (e.g. 54.743)
Instant vector - set of time series containing a single sample for each time series sharing the same timestamp (e.g. node_cpu_seconds_total finds all unique labels and value for each and they all will going to be at a single point in time)
Range vector - set of time series containing a range of data points over time for each time series (e.g. node_cpu_seconds_total[3m] finds all unique labels, but all values and timestamps from the past 3 minutes)

Selectors


if we only want to return a subset of times series for a metric = label matchers:

exact match = (e.g. node_filesystem_avail_bytes{instance="node1"})
negative equality != (e.g. node_filesystem_avail_bytes{device!="tmpfs"})
regular expression =~ (e.g. starts with /dev/sda - node_filesystem_avail_bytes{device=~"/dev/sda.*"})
negative regular expression !~ (e.g. mountpoint does not start with /boot - node_filesystem_avail_bytes{mountpoint!~"/boot.*"})


we can combine multiple selectors with comma ,: (e.g. node_filesystem_avail_bytes{instance="node1",device!="tmpfs"})

Modifiers


to get historic data, use an offset modifier after the label matching (e.g. get value 5 minutes ago - node_memory_active_bytes{instance="node1"} offset 5m)
to get to the exact point in time (e.g. get value on September 15 - node_memory_active_bytes{instance="node1"} @1663265188)
you can use both modifiers and order does not matter (e.g. @1663265188 offset 5m = offset 5m @1663265188)
you can also add range vectors (e.g. get 2 minutes worth of data 10 minutes before September 15 [2m] @1663265188 offset 5m)

Operators


between instant vectors and scalars
types:

Arithmetic +, -, *, /, %, ^ (e.g. node_memory_Active_bytes / 1024 - but it drops the metric name in the output as it is no longer the original metric!)
Comparison ==, !==, >, <, >=, <=, bool (e.g. node_network_flags > 100, node_network_receive_packets_total >= 220, node_filesystem_avail_bytes < bool 1000 returns 0 or 1 mostly for generating alerts)
Logical OR, AND, UNLESS (e.g. node_filesystem_avail_bytes > 1000 and node_filesystem_avail_bytes < 3000). Unless operator results in a vector consisting of elements on the left side for which there are no elements on the right side (e.g. return all vectors greater than 1000 unless they are greater than 30000 node_filesystem_avail_bytes > 1000 unless node_filesystem_avail_bytes > 30000)
more than one operator follows the order of precedence from highest to lowest, while operators on the same precedence level are performed from the left (e.g. 2 * 3 % 2 = (2 * 3) % 2), however power is performed from the left (e.g. 2 ^ 3 ^ 2 = 2 ^ (3 ^ 2)):


high  ^   ^
      |   *, /, %, atan2
      |   +, -
      |   ==, !=, <=, <, >=, >
      |   and, unless
low   |   or


Quiz

Q1: Construct a query to return all filesystems that have over 1000 bytes available on all instances under web job.
A1: node_filesystem_avail_bytes{job="web"} > 1000
Q2: Which of the following queries you will use for loadbalancer:9100 host to return all the interfaces that have received less than or equal to 10000 bytes of traffic?
A2: node_network_receive_bytes_total{instance="loadbalancer:9100"} <= 10000
Q3: node_filesystem_files tracks the filesystem's total file nodes. Construct a query that only returns time series greater than 500000 and less than 10000000 across all jobs
A3: node_filesystem_files > 500000 and node_filesystem_files < 10000000
Q4: The metric node_filesystem_avail_bytes lists the available bytes for all filesystems, and the metric node_filesystem_size_bytes lists the total size of all filesystems. Run each metric and see their outputs. There are three properties/labels these will return: device, fstype, and mountpoint. Which of the following queries will show the percentage of free disk space for all filesystems on all the targets under web job whose device label does not match tmpfs?
A4: node_filesystem_avail_bytes{job="web", device!="tmpfs"}*100 / node_filesystem_size_bytes{job="web", device!="tmpfs"}
Vector matching


between 2 instant vectors (e.g. to get the percentage of free space node_filesystem_avail_bytes / node_filesystem_size_bytes * 100 )
samples with exactly the same labels get matched together (e.g. instance and job and mountpoint must be the same to get a match) - every element in the vector on the left tries to find a single matching element on the right
to perform operation on 2 vectors with differing labels like http_errors code="500", code="501", code="404", method="put" etc. use the ignoring keyword (e.g. http_errors{code="500"} / ignoring(code) http_requests)
if the entries with e.g. methods put and del have no match in both metrics http_errors and http_requests, they will not show up in the results!
to get results on all labels to match on, we use the on keyword (e.g. http_errors{code="500"} / on(method) http_requests)
table - matching:


vector1
+ vector2
= resulting vector


{cpu=0,mode=idle}
{cpu=1,mode=steal}
{cpu=0}


{cpu=1,mode=iowait}
{cpu=2,mode=user}
{cpu=1}


{cpu=2,mode=user}
{cpu=0,mode=idle}
{cpu=2}


Resulting vector will have matching elements with all labels listed in on or all labels not/ignored: e.g. vector1{}+on(cpu) vector2{} or vector1{}+ignore(mode) vector2{}
Another example is: http_errors_total / ignoring(error) http_requests_total = http_errors_total / on(instance, job, path) http_requests_total


Quiz

Q1: Which of the following queries can be used to track the total number of seconds cpu has spent in user + system mode for instance loadbalancer:9100?
A1: node_cpu_seconds_total{instance="loadbalancer:9100", mode="user"} + ignoring(mode) node_cpu_seconds_total{instance="loadbalancer:9100", mode="system"}
Q2: Construct a query that will find out what percentage of time each cpu on each instance was spent in mode user. To calculate the percentage in mode user, get the total seconds spent in mode user and divide that by the sum of the time spent across all modes. Further, multiply that result by 100 to get a percentage.
A2: node_cpu_seconds_total{mode="user"}*100 / ignoring(mode, job) sum by(instance, cpu) (node_cpu_seconds_total)
Many-to-one vector matching


when you get error executing the query multiple matches for labels: many-to-one matching must be explicit (group_left/group_right)
it is where each vector elements on the one side can match with multiple elements on the many side (e.g. http_errors + on(path) group_left http_requests) - group_left tells PromQL that elements from the right side are now matched with multiple elements from the left (group_right is the opposite of that - depending on which side is the many and which side is one)


many
+ one
= resulting vector


{error=400,path=/cats} 2

{error=400,path=/cats} 4


{error=500,path=/cats} 5
{path=/cats} 2
{error=500,path=/cats} 7


{error=400,path=/dogs 1
{path=/dogs} 7
{error=400,path=/dogs} 8


{error=500,path=/dogs 7

{error=500,path=/dogs} 14


Quiz

Q1: The api job collects metrics on an API used for uploading files. The API has 3 endpoints /images, /videos and /songs, which are used to upload respective file types. The API provides 2 metrics to track: http_uploaded_bytes_total - tracks the number of uploaded bytes and http_upload_failed_bytes_total - tracks the number of bytes failed to upload. Construct a query to calculate the percentage of bytes that failed for each endpoint. The formula for the same is http_upload_failed_bytes_total*100 / http_uploaded_bytes_total.
A1: http_upload_failed_bytes_total*100 / ignoring(error) group_left http_uploaded_bytes_total
Aggregation operators


allow you to take an instan vector and aggregate its elements resulting in a new instant vector with fewer elements
sum, min, max, avg, group, stddev, stdvar, count, count_values, bottomk, topk, quantile
for example sum(http_requests), max(http_requests)
by keyword allows you to choose which labels to aggregate along (e.g. sum by(path) (http_requests), sum by(method) (http_requests), sum by(instance) (http_requests), sum by(instance, method) (http_requests))
without keyword does the opposite of by and tells the query which labels not to include in aggregation (e.g. sum without(cpu, mode) (node_cpu_seconds_total))


Quiz

Q1: On loadbalancer:9100 instance, calculate the sum of the size of all filesystems. The metric to get filesystem size is node_filesystem_size_bytes
A1: sum(node_filesystem_size_bytes{instance="loadbalancer:9100"})
Q2: Construct a query to find how many CPUs instance loadbalancer:9100 have. You can use the node_cpu_seconds_total metric to find out the same.
A2: count(sum by (cpu) (node_cpu_seconds_total{instance="loadbalancer:9100"}))
Q3: Construct a query that will show the number of CPUs on each instance across all jobs.
A3: ?
Q4: Use the node_network_receive_bytes_total metric to calculate the sum of the total received bytes across all interfaces on per instance basis
A4: sum by(instance)(node_network_receive_bytes_total)
Q5: Which of the following queries will be used to calculate the average packet size for each instance?
A5: sum by(instance)(node_network_receive_bytes_total) / sum by(instance)(node_network_receive_packets_total)
Functions


sorting, math, label transformations, metric manipulation
use the round function to round the query's result to the nearest integer value
truncate/round up to the closest integer: ceil(node_cpu_seconds_total)
round down: floor(node_cpu_seconds_total)
absolute value for negative numbers: abs(1-node_cpu_seconds_total)
date & time: time(), minute() etc.
vector function takes a scalar value and converts it into an instant vector: vector(4)
scalar function returns the value of the single element as a scalar (otherwise returns NaN if the input vector does not have exactly one element): scalar(process_start_time_seconds)
sorting: sort (ascending) and sort_desc (descending)
rate at which a counter metric increases: rate and irate (e.g. group together data points by 60 seconds, get last value minus first value in each of these 60s groups and divide it by 60: rate(http_errors[1m]); irate is similar to rate, but you get the last value and the second to last data points: irate(http_errors[1m]))
table - difference:


rate
irate


looks at the first and last data points within a range
looks at the last two data points within a range


effectively an average rate over the range
instant rate


best for slow moving counters and alerting rules
should be user for graphing volatile fast-moving counters


Notes:

make sure there is at least 4 samples within the time range (e.g. 15s scrape interval 60s window gives 4 samples)
when combining rate with an aggregation operator, always take rate() first, then aggregate (so it can detect counter resets)
to get the rate of increase of the sum of latency across all requests: rate(requests_latency_seconds_sum[1m])
to calculate the average latency of a request over the past 5m: rate(requests_latency_seconds_sum[5m]) / rate(requests_latency_seconds_count[5m])


Quiz

Q1: Management wants to keep track of the rate of bytes received by each instance. Each instance has two interfaces, so the rate of traffic being received on them must be summed up. Calculate the rate of received node_network_receive_bytes_total using 2-minute window, sum the rates across all interfaces, and group the results by instance.
A1: sum by(instance) (rate(node_network_receive_bytes_total[2m]))
Subquery


Syntax: <instant_query> [<range>:<resolution>] [offset <duration>]
Example: rate(http_requests_total[1m]) [5m:30s] - where sample range is 1m, query range is data from the last 5m and query step for subquery is 30s (gap between)
maximum value over a 10min of a gauge metric (max_over_time(node_filesystem_avail_bytes[10m]))
for counter metrics, we need to find the max value of the rate over the past 5min (e.g. maximum rate of request from the last 5 minutes with a 30s query interval and a sample range of 1m: max_over_time(rate(http_requests_total[1m]) [5m:30s])


Quiz

Q1: There were reports of a small outage of an application in the past few minutes, and some alerts pointed to potential high iowait on the CPUs. We need to calculate when the iowait rate was the highest over the past 10 minutes. [Construct a subquery that will calculate the rate at which all cpus spent in iowait mode using a 1 minute time window for the rate function. Find the max value of this result over the past 10 minutes using a 30s query step for the subquery.]
A1: ?
Q2: Construct a query to calculate the average over time (avg_over_time) rate of http_requests_total over the past 20m using 1m query step.
A2: ?
Recording rules


allow Prometheus to periodically evaluate PromQL expression and store the resulting times series generated by them
speeding up your dashboards
provide aggregated results for use elsewhere
recording rules go in a separate file called a rule file:
global: ...
rule_files:
  - rules.yml # globs can be used here, like /etc/prometheus/rule_files.d/*.yml
scrape_configs: ...

Prometheus server must be restarted after this change
syntax of the rules.yml file:
groups: # groups running in parallel
  - name: <group_name_1>
    interval: <evaluation interval, global by default>
    rules: # however, rules evaluated sequentially
      - record: <rule_name_1>
        expr: <promql_expression_1>
        labels:
          <label_name>: <label_value>
      - record: <rule_name_2> # you can also reference previous rule(s)
        expr: <promql_expression_1>
        labels:
  - name: <group_name_2>
    ...

example of the rules.yml file:
groups:
  - name: example1 # it will show up in the WebGui under "status" - "rules"
    interval: 15s
    rules:
      - record: node_memory_memFree_percent
        expr:  100 - (100 * node_memory_MemFree_bytes / node_memory_memTotal_bytes)
      - record: node_filesystem_free_percent
        expr: 100 * node_filesystem_free_bytes / node_filesystem_size_bytes

best practices for rule naming: aggregation_level:metric_name:operations, e.g. we have a http_errors counter with two instrumentation labels "method" and "path". All the rules for a specific job should be contained in a single group. It will look like:
- record: job_method_path:http_errors:rate5m
  expr: sum without(instance) (rate(http_errors{job="api"}[5m]))


HTTP API


execute queries, gather information on alerts, rules, service discovery related configs
send the POST request to http://<prometheus_ip>/api/v1/query
example: curl http://<prometheus_ip>:9090/api/v1/query --data 'query=node_arp_entries{instance="192.168.1.168:9100"}'
query at a specific time, just add another --data 'time=169386192'
response back as JSON

Dashboarding & Visualization


several different ways:

expression browser with graph tab (built-in)
console templates (built-in)
3rd party like Grafana


expression browser has limited functionality, only for ad-hoc queries and quick debugging, cannot create custom dashboards, not good for day-to-day monitoring, but can at least have multiple panels and compare graphs

Console Templates


allow to create custom HTML pages using Go templating language (typically between {{ and }})
Prometheus metrics, queries and charts can be embedded in the templates
ls /etc/prometheus/consoles to see the *.html and example (to see it, go to https://localhost:9090/consoles/index.html.example)
boilerplate will typically contain:
{{ template "head" . }}
{{ template "prom_content_head" . }}
<h1>Memory details</h1>
active memory: {{ template "prom_query_drilldown" (args "node_memory_Active_bytes") }}
{{ template "prom_content_tail" . }}
{{ template "tail" . }}

an example of inserting a chart:
{{ template "head" . }}
{{ template "prom_content_head" . }}
<h1>Memory details</h1>
active memory: {{ template "prom_query_drilldown" (args "node_memory_Active_bytes") }}
<div id="graph"></div>
<script>
new PromConsole.Graph({
node: document.querySelector("#graph"),
expr: "rate(node_memory_Active_bytes[2m])"
})
</script>
{{ template "prom_content_tail" . }}
{{ template "tail" . }}

another example with memory/cpu graphs:
{{ template "head" . }}
{{ template "prom_content_head" . }}
<h1>Node Stats</h1>
<h3>Memory</h3>
<strong>Memory utilization:</strong> {{ template "prom_query_drilldown" (args "100- (node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes*100)") }}
<br/>
<strong>Memory Size:</strong> {{ template "prom_query_drilldown" (args "node_memory_MemTotal_bytes/1000000" "Mb") }}
<h3>CPU</h3>
<strong>CPU Count:</strong> {{ template "prom_query_drilldown" (args "count(node_cpu_seconds_total{mode='idle'})") }}
<br/>
<strong>CPU Utilization:</strong> {{ template "prom_query_drilldown" (args "sum(rate(node_cpu_seconds_total{mode!='idle'}[2m]))*100/56") }}
<!-->
Expression explanation: The expression will take the current rate of all cpu modes except idle because idle means cpu isn’t being used. It will then sum them up and multiply them by 100 to give a percentage. This final number is divided by 56 (if this server/node has 56 CPUs, we want to get the utilization per CPU, so adjust this value as needed).
</!-->
<div id="cpu"></div>
<script>
new PromConsole.Graph({
node: document.querySelector("#cpu"),
expr: "sum(rate(node_cpu_seconds_total{mode!='idle'}[2m]))*100/2",
})
</script>
<h3>Network</h3>
<div id="network"></div>
<script>
new PromConsole.Graph({
node: document.querySelector("#network"),
expr: "rate(node_network_receive_bytes_total[2m])",
})
</script>
{{ template "prom_content_tail" . }}
{{ template "tail" . }}


Application Instrumentation


the Prometheus client libraries provide an easy way to add instrumentation to your code in order to track and expose metrics for Prometheus
they do 2 things:

Track metrics in the Prometheus expected format
Expose metrics via /metrics path so they can be scraped


official and unofficial libraries
Example for Python:

You have an existing API in Flask, run pip install prometheus_client
In your code, import it: from prometheus_client import Counter
Initialize counter object: REQUESTS = Counter('http_requests_total', 'Total number of requests')
When do we want to increment this? Within all of the @app.get("/path") like this: REQUESTS.inc()
We can also get total requests per path using different counter objects, but that is not recommended. Instead we can use labels:

REQUESTS = Counter('http_requests_total', 'Total number of requests', labelnames=['path'])
REQUESTS.labels('/cars').inc()


Then you can do the same approach for different HTTP method: labelnames=['path', 'method'] and REQUESTS.labels('/cars', 'post').inc()
How to expose to /metrics endpoint though?
from prometheus_client import Counter, start_http_server
if __name__ == '__main__':
  start_http_server(8000) # start the metrics server on port
  app.run(port='5001')    # this is the Flask app

curl 127.0.0.1:8000 will show the metrics
However, you can also expose the metrics from Flask route and have Flash app on http://localhost:5001 and Prometheus on http://localhost:5001/metrics like e.g. app.wsgi_app = DispatcherMiddleware(app.wsgi_app, { '/metrics': make_wsgi_app() })


complete working example:
from flask import Flask
from prometheus_client import Counter, start_http_server, Gauge

REQUESTS = Counter('http_requests_total', 'Total number of requests', labelnames=['path', 'method'])

ERRORS = Counter('http_errors_total',
                'Total number of errors', labelnames=['code'])

IN_PROGRESS = Gauge('inprogress_requests',
                    'Total number of requests in progress')

def before_request():
    IN_PROGRESS.inc()

def after_request(response):
    IN_PROGRESS.dec()
    return response

app = Flask(__name__)

@app.get("/products")
def get_products():
    REQUESTS.labels('products', 'get').inc()
    return "product"

@app.post("/products")
def create_product():
    REQUESTS.labels('products', 'get').inc()
    return "created product", 201

@app.get("/cart")
def get_cart():
    REQUESTS.labels('products', 'get').inc()
    return "cart"

@app.post("/cart")
def create_cart():
    REQUESTS.labels('products', 'get').inc()
    return "created cart", 201

@app.errorhandler(404)
def page_not_found(e):
    ERRORS.labels('404').inc()
    return "page not found", 404

if __name__ == '__main__':
    start_http_server(8000)
    app.run(debug=False, host="0.0.0.0", port='6000')


Implementing histogram & summary in your Python code (example)

# add histogram metric to track latency/response time for each request
LATENCY = Histogram('request_latency_seconds', 'Request Latency', labelnames=['path', 'method'])
# get before_request time via `request.start_time = time.time()`
# calculate after_request as `request_latency = time.time() minus request.start_time` and pass it to:
LATENCY.labels(request.method, request.path).observe(request_latency)

client libraries can let you specify bucket sizes (e.g. buckets=[0.01, 0.02, 0.1])
to configure summary, it is the exact same, just use LATENCY = Summary('......)

Implementing gauge metric in your Python code (example)

# track the number of active requests getting processed at the moment
IN_PROGRESS = Gauge('name', 'Description', labelnames=['path', 'method'])
# before_request will then increment IN_PROGRESS.inc()
# but after_request when it's done, then decrement IN_PROGRESS.dec()
Best practices


use snake_case naming, all lowercase, e.g. library_name_unit_suffix
first word should be app/library name it is used for
next add what is it used for
add unit (_bytes) at the end, use unprefixed base units (not microseconds or kilobytes)
avoid _count, _sum, _bucket suffixes
good examples: process_cpu_seconds, http_requests_total, redis_connection_errors, node_disk_read_bytes_total
bad examples: container_docker_restarts, http_requests_sum, nginx_disk_free_kilobytes, dotnet_queue_waiting_time
three types of services/apps:

online - immediate response is expected (tracking queries, errors, latency etc)
offline - no one is actively waiting for response (amount of queue, wip, processing rate, errors etc)
batch - similar to offline but regular, needs push gw (time processing, overall runtime, last completion time)


Service Discovery


allows Prometheus to dynamically update/populate/remove a list of endpoints to scrape
several built-ins: file, ec2, azure, gce, consul, nomad, k8s...
in the Web ui: "status" - "service discovery"

File SD


list of jobs/targets can be imported from a json/yaml file(s)
example #1:
scrape_configs:
  - job_name: file-example
    file_sd_configs:
      - files:
        - file-sd.json
        - '*.json'

then the file-sd.json would look like e.g.:
[
  {
    "targets": [ "node1:9100", "node2:9100" ],
    "labels": {
      "team": "dev",
      "job": "node"
    }
  }
]


AWS


just need to configure EC2 discovery in the config:
scrape_configs:
  - job_name: ec2
    ec2_sd_configs: # IAM with at least AmazonEC2ReadOnly policy
      - region: <region>
        access_key: <access key>
        secret_key: <secret key>

automatically extracts metadata for each EC2 instance
defaults to using private IPs

Re-labeling


classify Prometheus targets & metrics by rewriting their label set
e.g. rename instance from node1:9100 to just node1, drop metrics, drop labels etc
2 options:

relabel_configs (in Prometheus.yml) which occurs before scrape and only has access to labels added by SD mechanism
metric_relabel_configs (in Prometheus.yml) which occurs after the scrape


Examples - relabel_configs


example #1: __meta_ec2_tag_env = dev | prod
- job_name: aws
  relabel_configs:
    - source_labels: [__meta_ec2_tag_env] # array of labels to match on
      regex: prod                         # to match on specific value of that label
      action: keep|drop|replace           # keep=continue to scrape BUT in that case if regex is not match it will NOT be scraped (there is implicit invisible catchall at the end!), drop=no longer scrape this target

example #2: when there are more than 1 source labels (array) they will be joined by a ;
relabel_configs:
- source_labels: [env, team]  # if the target has {env=dev} and {team=marketing}, we will keep it
  regex: dev;marketing
  action: keep                # everything else will be dropped
  # separator: "-"            # optional, change the delimiter between labels use the separate property

target labels = labels that are added to the labels of every time series returned from a scrape, relabeling will drop all auto-discovered labels (starting with __). In other words, target labels are assigned to every metric from that specific target. Discovered labels are labels that start with a __ will be dropped after the initial relabeling process and will not get assigned as target labels.
example #3 of saving __address__=192.168.1.1:80 label in target label, but need to transform into {ip=192.168.1.1}
relabel_configs:
  - source_labels: [__address__]
    regex: (.*):.*    # assign everything before the `:` into a group referenced with `$1` below
    target_label: ip  # name of the new label
    action: replace
    replacement: $1

example #4 of combining labels env="dev" & team="web" will turn into info="web-dev"
relabel_configs:
  - source_labels: [team, env]
    regex: (.*);(.*)  # parenthesis allow you to use the values as $ below
    action: replace
    target_label: info
    replacement: $1-$2

example #5 Re-label so the label team name changes to the organization and the value gets prepended with org-
relabel_configs:
- source_labels: [team]
  regex: (.*)
  action: replace
  target_label: organization
  replacement: org-$1

to drop the label, use action: labeldrop based on the regex:
- regex: size
  action: labeldrop

the opposite of labeldrop is labelkeep - but keep in mind ALL other labels will be dropped!
- regex: instance|job
  action: labelkeep

to modify the label name (not the value), use labelmap like this:
- regex: __meta_ec2_(.*)  # match any of these ec2 discovered labels - e.g. __meta_ec2_ami="ami-abcdefgh123456"
  action: labelmap
  replacement: ec2_$1     # we will prepend it with `ec2` - e.g. ec2_ami="ami-abcdefgh123456"


Examples - metric_relabel_configs


takes place after the perform the scrape and has access to scraped metrics (not just the labels)
configuration is identical to relabel_configs
example #1:
- job_name: example
  metric_relabel_configs: # this will drop a metric http_errors_total
    - source_labels: [__name__]
      regex: http_errors_total
      action: drop        # or keep, which will drop EVERY other metrics

example #2:
- job_name: example
  metric_relabel_configs: # rename a metric name from http_errors_total to http_failures_total
    - source_labels: [__name__]
      regex: http_errors_total
      action: replace
      target_label: __name__            # whats the new name of the label key
      replacement: http_failures_total  # replacement is the new name of the value / the name of the metric

example #3:
- job_name: example
  metric_relabel_configs: # drop a label named code
    - regex: code
      action: labeldrop   # drop a label for a metric

example #4:
- job_name: example
  metric_relabel_configs: # strips of the forward slash and rename {path=/cars} -> {endpoint=cars}. Keep in mind there will now be a path as well as an endpoint. Use drop to get rid of the label path showing the same information.
    - source_labels: [path]
      regex: \/(.*)       # any text after the forward slash (wrapping it in parenthesis gives you access with $)
      action: replace
      target_label: endpoint
      replacement: $1     # match the original value


Push Gateway


By default, Pushgateway listens to port 9091
when process is already exited before the scrape occurred
middle man between batch job and Prometheus server
Prometheus will scrape metrics from the PG
installation:

pushgateway-1.4.3.linux-amd64.tar.gz from the releases page, untar, run ./pushgateway
create a new user sudo useradd --no-create-home --shell /bin/false pushgateway
copy the binary to /usr/local/bin, change owner to pushgateway, configure service file (same as the Prometheus)
systemctl daemon-reload, restart, enable
Test curl localhost:9091/metrics


configure Prometheus to scrape gateway. Same as other targets, but needs additional honor_labels: true (allows the metrics to specify custom labels like job1, job2 etc)
for sending the metrics, you send via HTTP POST request: http://<pushgateway_addr>:<port>/metrics/job/<job_name>/<label1>/<value1>/<label2>/<value2>... where job_name will be the job label of the metrics pushed, labels/values paths used as a grouping key, allows for grouping metrics together to update/delete multiple metrics at once. When sending a POST request, only metrics with the same name as the newly pushed, are replaced (this only applies to metrics in the same group):

see the original metrics:

processing_time_seconds{quality="hd"} 120
processed_videos_total{quality="hd"} 10
processed_bytes_total{quality="hd"} 4400


POST the processing_time_seconds{quality="hd"} 999
result:

processing_time_seconds{quality="hd"} 999
processed_videos_total{quality="hd"} 10
processed_bytes_total{quality="hd"} 4400


example: push metric example_metric 4421 with a job label of {job="db_backup"}:
# ('@-' tells curl to read the binary data from stdin)
echo "example_metric 4421 | curl --data-binary @-http://localhost:9091/metrics/job/db_backup

another example with sending multiple metrics at once:
cat <<EOF | curl --data-binary @- http://localhost:9091/metrics/job/video_processing/instance/mp4_node1
processing_time_seconds{quality="hd"} 120
processed_videos_total{quality="hd"} 10
processed_bytes_total{quality="hd"} 4400
EOF

When using HTTP PUT request however, the behavior is different. All metrics within a specific group get replaced by the new metrics being pushed (deletes preexisting):

start with:

processing_time_seconds{quality="hd"} 999
processed_videos_total{quality="hd"} 10
processed_bytes_total{quality="hd"} 4400


PUT the processing_time_seconds{quality="hd"} 666
result:

processing_time_seconds{quality="hd"} 666


HTTP DELETE request will delete all metrics within a group (not going to touch any metrics in the other groups): curl -X DELETE http://localhost:9091/metrics/job/archive/app/web will only delete all with {app="web"}

Client library


Python: from prometheus_client import CollectorRegistry, pushadd_to_gateway, then initialize registry = CollectorRegistry(). You can then push via pushadd_to_gateway('user2:9091', job='batch', registry=registry)
3 functions within a library to push metrics:

push - same as HTTP PUT (any existing metrics for this job are removed and the pushed metrics added)
pushadd - same as HTTP POST (overrides existing metrics with the same names, but all other metrics in group remain unchanged)
delete - same as HTTP DELETE (all metrics for a group are removed)


Alerting


let's you define condition that if met trigger alerts
these are standard PromQL expressions (e.g. node_filesystem_avail_bytes < 1000 = 547)
Prometheus is only responsible for triggering alerts
responsibility of sending notification is offloaded onto alertmanager -> Slack, email, SMS etc.
alerts are visible in the web gui under "alerts" and they are green if not alerting
alerting rules are similar to recording rules, in fact they are in the same location (rule_files in prometheus.yaml):
groups:
  - name: node
    interval: 15s
    rules:
      - record: ...
        expr: ...
      - alert: LowMemory
        expr: node_memory_memFree_percent < 20

The for clause tells Prometheus that an expression must evaluate true for specific period of time:
- alert: node down
  expr: up{job="node"} == 0
  for: 5m   # expects the node to be down for 5 minutes before firing an alert

3 alert states:

inactive - has not returned any results [green]
pending - it hasn't been long enough to be considered firing (related to for) [orange]
firing - active for more than the defined for clause [red]


Labels & Annotations


optional labels can be added to alerts to provide a mechanism to classify and match alerts
important, because they can be used when you set up rules in the alert manager so you can match on these and group them together

- alert: node down
  expr: ...
  labels:
    severity: warning
- alert: multiple nodes down
  expr: ...
  labels:
    severity: critical

annotations (use Go templating) can be used to provide additional/descriptive information (unlike labels they do not play a part in the alerts identity)

- alert: node_filesystem_free_percent
  expr: ...
  annotations:
    description: "Filesystem {{.Labels.device}} on {{.Labels.instance}} is low on space, current available space is {{.Value}}"
This is how the templating works:

{{.Labels}} to access alert labels
{{.Labels.instance}} to get instance label
{{.Value}} to get the firing sample value

Alertmanager


By default, Alertmanager is running on port 9093
responsible for receiving alerts generated by Prometheus and converting them to notifications
supports multiple Prometheus servers via API
workflow:

dispatcher picks up the alerts first,
inhibition allows suppress certain alerts if other alerts exist,
silencing mutes alerts (e.g. maintenance)
routing is responsible what alert gets to send where
notification integrates with all 3rd party tools (email, Slack, SMS, etc.)


installation:

tarball (alertmanager-0.24.0.linux-amd64.tar.gz) contains alertmanager binary, alertmanager.yml config file, amtool command line utility and data folder where the notification states are stored
The installation is the same as previous tools (add new user, create /etc/alertmanager, create /var/lib/alertmanager, copy executables to /usr/local/bin, change ownerships, create service file, daemon-reload, start, enable). ExecStart in systemd expects --config.file and --storage.path!
starting is simple ./alertmanager and listens on 9093 (you can see the interface on https://localhost:9093)
restarting AM can be done via HTTP POST to /-/reload endpoint, systemctl restart alertmanager or killall -HUP alertmanager


configure Prometheus to use that alertmanager:
global: ...
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - 127.0.0.1:9093
            - alertmanager2:9093

there are 3 main sections of alertmanager.yml:

global - applies across all sections which can be overwritten (e.g. smtp_smarthost)
route - set of rules to determine what alerts get matched up (match_re, matchers) with what receiver


at the top level, there is a default route - any alerts that don't match any of the other routes will use this default, example route:

route:
  routes:
    - match_re:               # regular expression
        job: (node|windows)
      receiver: infra-email
    - matchers:               # all alerts with job=kubernetes & severity=ticket labels will match this rule
        job: kubernetes
        severity: ticket
      receiver: k8s-slack     # they will be send to this receiver

nested routes / subroutes are also supported:

routes:
- matchers:                   # parent route
    job: kubernetes           # 2. all other alerts with this label will match this main route (k8s-email)
  receiver: k8s-email
  routes:                     # sub-route for further route matching (logical AND)
    - matchers:
        severity: pager       # 1. if the alert has also label severity=pager, then it will be send to k8s-pager
      receiver: k8s-pager

if you need an alert to match two routes, use continue:

route:
  routes:
    - receiver: alert-logs    # all alerts to be sent to alert-logs
      continue: true
    - matchers:
        job: kubernetes       # AND then if it also has this label job=kubernetes, it will be also sent to k8s-email
      receiver: k8s-email

grouping allows to split up your notification by labels (otherwise all alerts results in one big notification):

receiver: fallback-pager
group_by: [team]
routes:
  - matchers:
      team: infra
    group_by: [region,env]    # infra team has alerts grouped based on region and env labels
    receiver: infra-email
    # any child routes underneath here will inherit the grouping policy and group based on same 2 labels region, env

receivers - one or more notifiers to forward alerts to users (e.g. slack_configs)


make use of global configurations so all of the receivers don't have to manually define the same key:

global:
  victorops_api_key: XXX      # this will be automatically provided to all receivers below
receivers:
  - name: infra-pager
    victorops_configs:
      - routing_key: some-route-here

you can customize the message by using Go templating:

GroupLabels     (e.g. title: in slack_configs: {{.GroupLabels.severity}} alerts in region {{.GroupLabels.region}})
CommonLabels
CommonAnnotations
ExternalURL
Status
Receiver
Alerts          (e.g. text: in slack_configs: {{.Alerts | len}} alerts:)

Labels
Annotations   ({{range .Alerts}}{{.Annotations.description}}{{"\n"}}{{end}})
Status
StartsAt
EndsAt


Example alertmanager.yml config:
global:
  smtp_smarthost: 'localhost:25'
  smtp_from: 'alertmanager@prometheus-server.com'
route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 2m
  repeat_interval: 1h
  receiver: 'general-email'
  routes:
    - matchers:
            - team=global-infra
      receiver: global-infra-email
    - matchers:
            - team=internal-infra-email
      receiver: internal-infra-email
receivers:
  - name: 'web.hook'
    webhook_configs:
      - url: 'http://127.0.0.1:5001/'
  - name: global-infra-email
    email_configs:
            - to: root@prometheus-server.com
              require_tls: false
  - name: internal-infra-email
    email_configs:
            - to: admin@prometheus-server.com
              require_tls: false
  - name: general-email
    email_configs:
            - to: admin@prometheus-server.com
              require_tls: false


Silences


alerts can be silence to prevent generating notifications for a period of time (like maintenance windows)
in the "new silence" button - specify start, end/duration, matchers (list of labels), creator, comment
you can then view those in the "silence" tab

Monitoring Kubernetes


for both applications & clusters (control plane components, kubelet/cAdvisor, kube-state-metrics, node-exporter)
deploy Prometheus as close to targets as possible
make use of preexisting Kube infrastructure
to get access to cluster level metrics, we need kube-state-metrics
every host should run node-exporter on every node (DaemonSet)
make use of service discovery via Kube API

Installation via Helm chart


source: https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack
makes use of the Prometheus Operator (https://github.com/prometheus-operator/prometheus-operator)
couple of custom resources (CRD): Prometheus, Prometheus Rule, Alermanager Config, ServiceMonitor, PodMonitor
Add Helm repo: helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
Update Helm repo: helm repo update
Export all possible values: helm show values prometheus-community/kube-prometheus-stack > values.yaml
Install the chart: helm install prometheus-community/kube-prometheus-stack
(Optional kubectl patch ds prometheus-prometheus-node-exporter --type "json" -p '[{"op": "remove", "path" : "/spec/template/spec/containers/0/volumeMounts/2/mountPropagation"}]' - might need this  due to node-exporter bug)


What does it do?

installs 2 StatefulSets (AM, Prometheus), 3 Deployments (Grafana, kube-prometheus-operator, kube-state-metrics), 1 DaemonSet (node-exporter)
SD can discover node, service, pod, endpoint (discovers targets from listed endpoints of a service. For each endpoint address one target is discovered per port. If the endpoint is backed up by a pod, all additional container ports of the pod, not bound to an endpoint port, are discovered as targets as well)


Monitor K8s Application


once you have application deployed and listening on some port (i.e. 3000), you can change the Prometheus value additionalScrapeConfigs in the Helm chart and upgrade via helm upgrade prometheus prometheus-community/kube-prometheus-stack -f new-values.yaml (this is less ideal option, it is better to use service monitors to apply new scrapes more declaratively)
instead, look at CRDs: kubectl get crd, specifically prometheuses, servicemonitors (set of targets to monitor and scrape, they allow to avoid touching config directly and give you a declarative Kube syntax to define targets)
if you want to scrape e.g. service named api-service exposing metrics on /swagger-stats/metrics, use:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: api-service-monitor
  labels:
    release: prometheus # default label that is used by serviceMonitorSelector - it dynamically discovers it
    app: prometheus
spec:
  jobLabel: job       # look for label job in the Service and take the value
  endpoints:
    - interval: 30s   # equivalent of scrape_interval
      port: web       # matches up with the port 3000 in the Service definition
      path: /swagger-stats/metrics  # equivalent of metrics_path (path where the metrics are exposed)
  selector:
    matchLabels:
      app: service-api

but also look at kind: Prometheuses and what is under serviceMonitorSelector (e.g. matchLabels: release: prometheus) - this label allows Prometheus fo find service monitors in the cluster and register them so that it can start scraping the app the service monitor is pointing to (can confirm via Web UI - Status - Configuration)
to add rules, use CRD called PrometheusRule - e.g.:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    release: prometheus   # similar to ServiceMonitor, to add the rule dynamically
  name: api-rules
spec:
  groups:
    - name: api
      rules:
        - alert: down
          expr: up == 0
          for: 0m
          labels:
            severity: critical
          annotations:
            summary: Prometheus target missing {{$labels.instance}}

to add AM rules, use CRD called AlertmanagerConfig - e.g.:
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: alert-config
  labels:
    resource: prometheus  # once again, must match alertmanagerConfigSelector - BUT Helm chart does not specify a label, so you need to update this value yourself!
spec:
  route:
    groupBy: ["severity"]
    groupWait: 30s
    groupInterval: 5m
    repeatInterval: 12h
    receiver: "webhook"
  receivers:
    - name: "webhook"
      webhookConfigs:
        - url: "http://example.com/"

table - keep in mind the differences between a standard AM and K8s one:


Standard
Kubernetes


group_by
groupBy


group_wait
groupWait


group_interval
groupInterval


repeat_interval
repeatInterval


matchers job: kubernetes
matchers name: job, value: kubernetes


Conclusion

Default ports:


Component
Port number


prometheus
9090


node-exporter
9100


push gateway
9091


alertmanager
9093


KodeKloud slides: https://kodekloud.com/wp-content/uploads/2022/12/Prometheus_Certified_Associate-1.pdf
You will have 1.5 hours to complete the exam.
The certification is valid for 3 years.
This exam is online, proctored with multiple-choice questions.
One retake is available for this exam.
Important links:

Prometheus Certified Associate(PCA) registration link: https://training.linuxfoundation.org/certification/prometheus-certified-associate/
Exam curriculum: https://github.com/cncf/curriculum/blob/master/PCA_Curriculum.pdf
Certification FAQs: https://docs.linuxfoundation.org/tc-docs/certification/frequently-asked-questions-pca
Candidate Handbook: https://docs.linuxfoundation.org/tc-docs/certification/lf-handbook2
To ensure your system meets the exam requirements, visit this link: https://syscheck.bridge.psiexams.com/
Important exams instructions to check before scheduling the exam: https://docs.linuxfoundation.org/tc-docs/certification/important-instructions-pca


Author: @luckylittle
Last update: Wed Jan 25 05:22:25 UTC 2023
histogram	summary
bucket sizes can be picked	quantile must be defined ahead of time
less taxing on client libraries	more taxing on client libraries
any quantile can be selected	only quantiles predefined in client can be used
Prometheus server must calculate quantiles	very minimal server-side cost
vector1	+ vector2	= resulting vector
{cpu=0,mode=idle}	{cpu=1,mode=steal}	{cpu=0}
{cpu=1,mode=iowait}	{cpu=2,mode=user}	{cpu=1}
{cpu=2,mode=user}	{cpu=0,mode=idle}	{cpu=2}
many	+ one	= resulting vector
{error=400,path=/cats} 2		{error=400,path=/cats} 4
{error=500,path=/cats} 5	{path=/cats} 2	{error=500,path=/cats} 7
{error=400,path=/dogs 1	{path=/dogs} 7	{error=400,path=/dogs} 8
{error=500,path=/dogs 7		{error=500,path=/dogs} 14
rate	irate
looks at the first and last data points within a range	looks at the last two data points within a range
effectively an average rate over the range	instant rate
best for slow moving counters and alerting rules	should be user for graphing volatile fast-moving counters
Standard	Kubernetes
group_by	groupBy
group_wait	groupWait
group_interval	groupInterval
repeat_interval	repeatInterval
matchers job: kubernetes	matchers name: job, value: kubernetes
Component	Port number
prometheus	9090
node-exporter	9100
push gateway	9091
alertmanager	9093