kivanio/alert-manager.md

## alert-manager.md

      
    Raw
  

              alert-manager.md
            
          
    TL;DR

Clone this repo:
git clone https://gist.github.com/08be6d6e7605a43fe52d1f201c2b47d8.git
cd 08be6d6e7605a43fe52d1f201c2b47d8
Start the docker stack:
docker-compose up -d
And visit http://localhost:8080 to check current alerts.
To simulate a downtime case, you can stop the monitored service (rabbitmq):
$ docker-compose scale rabbitmq=0
And wait a few seconds to check a new alert:

EOTLDR
What is this?


This is a small experiment to get familiar with alerting tools for Prometheus. AlertManager allows us to trigger alerts based on Prometheus metrics values, to several destinations such as Slack, Email, Pagerduty, OpsGenie or custom Webhooks.
How does it work?

As shown on the diagram, each one of our apps is monitored by Prometheus, who takes relevant metrics about the health of each service (like uptime, CPU usage, number of requests, ...). These metrics can be displayed in a fancy Grafana dashboard, as well as captured by AlertManager, which evaluates these metrics under several conditions to decide if creating/resolving alerts. These alerts can trigger Email and Slack notifications, and can be concentrated in a single UI to have real-time track of our incidents.
Show me the code

The first file we would like to see is docker-compose.yml. It contains a small set of services, similar to the previous diagram:

I chose RabbitMQ as the monitored service without any reason. Just picked this up because of its ease to be managed and monitored by Prometheus.
As we can see on the first 2 services, Prometheus requires to fetch metrics from an exporter container, which in this case is called rabbitmq-exporter. This container gets metrics from RabbitMQ (using RABBIT_URL variable), and publishes them in a format which is compatible with Prometheus:
# docker-compose.yml
rabbitmq:
  image:             rabbitmq:3.7.8-management-alpine
  restart:           always
rabbitmq-exporter:
  image:             kbudde/rabbitmq-exporter:v0.29.0
  restart:           always
  environment:
    RABBIT_URL:      "http://rabbitmq:15672"
Then, we have a prometheus container that reads the metrics exposed by rabbitmq-exporter:
# docker-compose.yml
prometheus:
  image:             prom/prometheus:v2.6.0
  restart:           always
  volumes:
    - ./prometheus.yml:/etc/prometheus/prometheus.yml
    - ./alerting_rules.yml:/etc/prometheus/alerting_rules.yml
The link between prometheus and rabbitmq-exporter is configured in prometheus.yml file, that is given to Prometheus container as a volume:
# prometheus.yml
scrape_configs:
  - job_name: 'rabbitmq-test'
    scrape_interval: 1s
    metrics_path: /metrics
    static_configs:
      - targets: ['rabbitmq-exporter:9090']
Which says to fetch metrics from RabbitMQ exporter every 1 second.
The next step is to configure Alert Rules to be used by AlertManager. These are also part of Prometheus configuration, and are defined in alerting_rules.yml file and provided as a volume to Prometheus container:
# alerting_rules.yml
groups:
  - name: alerting_rules
    interval: 1s
    rules:
      - alert: rabbitmqDown
        expr: rabbitmq_up == 0
        for: 10s
        labels:
          severity: critical
        annotations:
          summary: "RabbitMQ is Down"
          description: "RabbitMQ is so dead"
This file defines a rule to check if rabbitmq_up metric (exposed by rabbitmq-exporter) has the value 0 during 10 seconds. If so, a new critical Alert called rabbitmqDown will be created with some comments defined at annotations section.
Now that we have Prometheus generating Alerts, it is time for AlertManager to come in:
# docker-compose.yml
alertmanager:
  image: prom/alertmanager:v0.15.3
  restart: always
  volumes:
    - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
Using the official AlertManager docker image, we can quickly configure a container to fetch Alerts from Prometheus and trigger Slack notifications:
# alertmanager.yml
route:
  group_by: ['instance', 'severity']
  routes:
  - match:
      alertname: rabbitmqDown
  receiver: 'tranque-slack-hook'

receivers:
- name: 'tranque-slack-hook'
  slack_configs:
  - api_url: "https://hooks.slack.com/services/your-slack-hook"
    title: "{{ .CommonAnnotations.summary }}"
    title_link: ""
    text: "RabbitMQ server got down for 10 seconds"
Given the alert called rabbitmqDown (defined in alerting_rules.yml file), we can trigger a Slack notification declared as tranque-slack-hook. That receiver configuration requires to specify a Slack hook URL, as well as a Title and Description of the message to be sent to our Slack channel. The result will look like this:

Finally, we can set up an Unsee container as a real-time Alert UI, to concentrate all current events received by AlertManager:
unsee:
  image: cloudflare/unsee:latest
  restart: always
  environment:
    ALERTMANAGER_URI: http://alertmanager:9093
  ports:
    - 8080:8080
If we forward unsee's port 8080 to our localhost, we will be able to see an interface like this:

That contains all active alerts and their corresponding descriptions and labels.
EODOC

  
## alerting_rules.yml
# alert rules (to be used by AlertManager)
groups:
  - name: alerting_rules
    # rules evaluation period
    interval: 1s
    rules:
      # rabbitmq metrics simple rule
      - alert: rabbitmqDown
        # rabbitmq is not running
        expr: rabbitmq_up == 0
        # during 10 seconds
        for: 10s
        labels:
          # sounds important
          severity: critical
        annotations:
          summary: "RabbitMQ is Down"
          description: "RabbitMQ is so dead"

## alertmanager.png

      
    Raw
  

              alertmanager.png
            
          
## alertmanager.yml
global:
  resolve_timeout: 1m

route:
  group_by: ['instance', 'severity']
  # wait to send notification
  group_wait: 1s
  # wait to resend notification
  repeat_interval: 1h
  routes:
  - match:
      # alert name defined in alerting_rules.yml
      alertname: rabbitmqDown
  receiver: 'tranque-slack-hook'

receivers:
# slack hook configuration
- name: 'tranque-slack-hook'
  slack_configs:
  - api_url: "https://hooks.slack.com/services/your-slack-hook"
    title: "{{ .CommonAnnotations.summary }}"
    title_link: ""
    text: "RabbitMQ server got down for 10 seconds"

## alertmanager_experiment.png

      
    Raw
  

              alertmanager_experiment.png
            
          
## docker-compose.yml
version:               "3.4"

services:
  # set up any service to be monitored using prometheus. I chose rabbitmq with no particular reason
  rabbitmq:
    image:             rabbitmq:3.7.8-management-alpine
    restart:           always
  # export rabbitmq metrics to prometheus
  rabbitmq-exporter:
    image:             kbudde/rabbitmq-exporter:v0.29.0
    restart:           always
    environment:
      RABBIT_URL:      "http://rabbitmq:15672"
  # prometheus server
  prometheus:
    image:             prom/prometheus:v2.6.0
    restart:           always
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alerting_rules.yml:/etc/prometheus/alerting_rules.yml
  # alertmanager server
  alertmanager:
    image: prom/alertmanager:v0.15.3
    restart: always
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
  # alertmanager UI
  unsee:
    image: cloudflare/unsee:latest
    restart: always
    environment:
      ALERTMANAGER_URI: http://alertmanager:9093
      ALERTMANAGER_PROXY: true
      ALERTMANAGER_INTERVAL: 5s
    ports:
      - 8080:8080

## prometheus.yml
# set a label to this prometheus instance
global:
  external_labels:
      monitor: 'tranque-monitor'

# connect to AlertManager
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - 'alertmanager:9093'

# include alert rules (to be used by AlertManager)
rule_files:
  - "/etc/prometheus/alerting_rules.yml"

# read metrics from several sources
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'rabbitmq-test'
    scrape_interval: 1s
    metrics_path: /metrics
    static_configs:
      - targets: ['rabbitmq-exporter:9090']

## slack-notification.png

      
    Raw
  

              slack-notification.png
            
          
## unsee.png

      
    Raw
  

              unsee.png
	# alert rules (to be used by AlertManager)
	groups:
	- name: alerting_rules
	# rules evaluation period
	interval: 1s
	rules:
	# rabbitmq metrics simple rule
	- alert: rabbitmqDown
	# rabbitmq is not running
	expr: rabbitmq_up == 0
	# during 10 seconds
	for: 10s
	labels:
	# sounds important
	severity: critical
	annotations:
	summary: "RabbitMQ is Down"
	description: "RabbitMQ is so dead"
	global:
	resolve_timeout: 1m

	route:
	group_by: ['instance', 'severity']
	# wait to send notification
	group_wait: 1s
	# wait to resend notification
	repeat_interval: 1h
	routes:
	- match:
	# alert name defined in alerting_rules.yml
	alertname: rabbitmqDown
	receiver: 'tranque-slack-hook'

	receivers:
	# slack hook configuration
	- name: 'tranque-slack-hook'
	slack_configs:
	- api_url: "https://hooks.slack.com/services/your-slack-hook"
	title: "{{ .CommonAnnotations.summary }}"
	title_link: ""
	text: "RabbitMQ server got down for 10 seconds"
	version: "3.4"

	services:
	# set up any service to be monitored using prometheus. I chose rabbitmq with no particular reason
	rabbitmq:
	image: rabbitmq:3.7.8-management-alpine
	restart: always
	# export rabbitmq metrics to prometheus
	rabbitmq-exporter:
	image: kbudde/rabbitmq-exporter:v0.29.0
	restart: always
	environment:
	RABBIT_URL: "http://rabbitmq:15672"
	# prometheus server
	prometheus:
	image: prom/prometheus:v2.6.0
	restart: always
	volumes:
	- ./prometheus.yml:/etc/prometheus/prometheus.yml
	- ./alerting_rules.yml:/etc/prometheus/alerting_rules.yml
	# alertmanager server
	alertmanager:
	image: prom/alertmanager:v0.15.3
	restart: always
	volumes:
	- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
	# alertmanager UI
	unsee:
	image: cloudflare/unsee:latest
	restart: always
	environment:
	ALERTMANAGER_URI: http://alertmanager:9093
	ALERTMANAGER_PROXY: true
	ALERTMANAGER_INTERVAL: 5s
	ports:
	- 8080:8080
	# set a label to this prometheus instance
	global:
	external_labels:
	monitor: 'tranque-monitor'

	# connect to AlertManager
	alerting:
	alertmanagers:
	- static_configs:
	- targets:
	- 'alertmanager:9093'

	# include alert rules (to be used by AlertManager)
	rule_files:
	- "/etc/prometheus/alerting_rules.yml"

	# read metrics from several sources
	scrape_configs:
	- job_name: 'prometheus'
	static_configs:
	- targets: ['localhost:9090']

	- job_name: 'rabbitmq-test'
	scrape_interval: 1s
	metrics_path: /metrics
	static_configs:
	- targets: ['rabbitmq-exporter:9090']