Skip to content

Instantly share code, notes, and snippets.

@fstab
Last active October 5, 2021 14:24
Show Gist options
  • Star 5 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save fstab/3b4dc2b253174773a80bd86634da7523 to your computer and use it in GitHub Desktop.
Save fstab/3b4dc2b253174773a80bd86634da7523 to your computer and use it in GitHub Desktop.
Prometheus Workshop Notes

These are notes for my Prometheus workshop. The follow-up workshop on Prometheus/Kubernetes can be found here.

Overview

  • Technology: Time Series Database
  • Approach: Black Box vs White Box
  • Scope: Time Series (Prometheus) vs. Logfiles (ELK), vs. Tracing (Zipkin)

node_exporter

  • Download, extract, run ./node_exporter
  • Show example metrics (node_cpu, node_network_receive_bytes, node_filesystem_avail)
  • What is a time series, what are labels?
  • Run second instance using ./node_exporter -web.listen-address=":9101"

Prometheus Server

  • Download Prometheus 2.0.0-beta.2, extract, edit prometheus.yml and add node1 and node2 Job to scrape_configs, run ./prometheus
  • Show Status -> Targets (scrape Interval is 15s, node will be "up" after 15s)
  • Push vs Pull, HA Prometheus
  • Example Queries:
    • node_network_receive_bytes -> Mention instance label, which is added by the Prometheus Server.
    • node_network_receive_bytes{device="lo0"} -> Bonus Question: Why are the values for node1 and node2 different?
    • sum(node_network_receive_bytes)
    • sum (node_network_receive_bytes) by(instance)
    • sum without(instance) (node_network_receive_bytes)
    • node_network_receive_bytes[5m]
    • rate(node_network_receive_bytes[5m])
    • rate(node_network_receive_bytes[5m]) / 1024
    • sum(rate(node_network_receive_bytes[5m]) / 1024
    • sum(rate(node_network_receive_bytes[5m]) / 1024) by (instance)
    • sum without (device) (rate(node_network_receive_bytes[5m]) / 1024)

Grafana

  • There doesn't seem to be a binary download for Mac anymore. So run as follows: docker run --rm -t -i -p 3000:3000 grafana/grafana.
  • Login as admin/admin
  • Add data source: Name prometheus, Type Prometheus, URL http://public-ip:9090 (Can't use localhost because Grafana is in the Docker container and the Prometheus server is outside of the Docker container. Use the public IP address instead), Access proxy
  • Import and show example dashboard
  • Some things don't display correctly in the example dashboard, because the dashboard is for Prometheus 1.x, and we run 2.x beta. Example Fix: Edit Uptime, replace process_start_time_seconds with prometheus_config_last_reload_success_timestamp_seconds.
  • New metric with example query: sum without (device, job) (rate(node_network_receive_bytes[5m]))

Alerts in Prometheus

  • Show metric rate(http_requests_total{job="node1"}[1m]) in Prometheus test UI (including graph) -> average number of requests per second during the last minute
  • Create file alerting.rules in old format:
ALERT MuchTraffic
    IF rate(http_requests_total{job="node1"}[1m]) > 5
    FOR 1m
    ANNOTATIONS {
        summary = "High request rate on {{ $labels.instance }}",
        description = "{{ $labels.instance }} has a request rate above 5 requests / second (current value: {{ $value }} requests / second)",
    }
  • Use promtool to convert to new format: ./promtool update rules alerting.rules (will create new file alerting.rules.yml)
  • Show alerting.rules.yml and verify with ./promtool check rules alerting.rules.yml
  • Add to rule_files section in prometheus.yml: - "alerting.rules.yml", then restart Prometheus
  • Go to "Alerts" tab to see that the alert is not active.
  • Run watch -n 0.1 wget -O- http://localhost:9100/metrics to make alert active (after 1 minute).
  • During that minute, explain the watch command and tell that rules are not only used for alerting, but also for recording.

Alert Manger

  • Download, extract, run ./alertmanager -log.level debug -config.file simple.yml
  • Add the following to prometheus.yml:
    alerting:
      alertmanagers:
        - static_configs:
          - targets: ['localhost:9093']
    
  • Show alertmanager's simple.yml to explain some config options.

Advanced Topics

Question: How to model HTTP Server Response Times?

  • Example Histogram:
    http_request_duration_seconds_bucket{le="0.005"}  0
    http_request_duration_seconds_bucket{le="0.01"}   0
    http_request_duration_seconds_bucket{le="0.025"}  3
    http_request_duration_seconds_bucket{le="0.05"}  10
    http_request_duration_seconds_bucket{le="0.1"}   22
    http_request_duration_seconds_bucket{le="0.25"}  40
    http_request_duration_seconds_bucket{le="0.5"}   52
    http_request_duration_seconds_bucket{le="1.0"}   59
    http_request_duration_seconds_bucket{le="2.5"}   59
    http_request_duration_seconds_bucket{le="5"}     60
    http_request_duration_seconds_bucket{le="10"}    60
    http_request_duration_seconds_bucket{le="+Inf"}  60
    http_request_duration_seconds_count 60
    
  • Expose example Histogram:
    • Save to file example-histogram.txt
    • Run python -m SimpleHTTPServer 9301
    • Add to prometheus.yml:
      - job_name: 'example'
        metrics_path: '/example-histogram.txt'
        static_configs:
        - targets: ['localhost:9301']
      
    • Restart Prometheus
  • Example 1: Overall rate <= 250ms is 2/3 (= 40/60): sum(http_request_duration_seconds_bucket{le="0.25"}) by (job) / sum (http_request_duration_seconds_count) by (job)
  • Example 2: Rate in last 5m window: sum(rate(http_request_duration_seconds_bucket{le="0.25"}[5m])) by (job) / sum(rate(http_request_duration_seconds_count[5m])) by (job). Does not work with static example, because the numbers did not increase in the last 5m. Edit example-histogram.txt and increase some numbers for demo.
  • Advanced question: Why sum(rate(...)) and not rate(sum(...))?
  • Example summary
    http_request_duration_seconds{quantile="0.5"} 0.25
    http_request_duration_seconds{quantile="0.9"} 0.3
    http_request_duration_seconds{quantile="0.99"} 2.7
    http_request_duration_seconds_sum 30.0
    http_request_duration_seconds_count 100.0
    
  • Explain histogram_quantile() function.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment