Skip to content

Instantly share code, notes, and snippets.

@kivanio
Forked from nicosingh/alert-manager.md
Created April 4, 2022 12:43
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kivanio/b255ce4422a7ef961dd9aa5ac330e831 to your computer and use it in GitHub Desktop.
Save kivanio/b255ce4422a7ef961dd9aa5ac330e831 to your computer and use it in GitHub Desktop.
A small and local configuration for Prometheus + AlertManager + Slack notifications + Unsee

TL;DR

Clone this repo:

git clone https://gist.github.com/08be6d6e7605a43fe52d1f201c2b47d8.git
cd 08be6d6e7605a43fe52d1f201c2b47d8

Start the docker stack:

docker-compose up -d

And visit http://localhost:8080 to check current alerts.

To simulate a downtime case, you can stop the monitored service (rabbitmq):

$ docker-compose scale rabbitmq=0

And wait a few seconds to check a new alert:

EOTLDR

What is this?

This is a small experiment to get familiar with alerting tools for Prometheus. AlertManager allows us to trigger alerts based on Prometheus metrics values, to several destinations such as Slack, Email, Pagerduty, OpsGenie or custom Webhooks.

How does it work?

As shown on the diagram, each one of our apps is monitored by Prometheus, who takes relevant metrics about the health of each service (like uptime, CPU usage, number of requests, ...). These metrics can be displayed in a fancy Grafana dashboard, as well as captured by AlertManager, which evaluates these metrics under several conditions to decide if creating/resolving alerts. These alerts can trigger Email and Slack notifications, and can be concentrated in a single UI to have real-time track of our incidents.

Show me the code

The first file we would like to see is docker-compose.yml. It contains a small set of services, similar to the previous diagram:

I chose RabbitMQ as the monitored service without any reason. Just picked this up because of its ease to be managed and monitored by Prometheus.

As we can see on the first 2 services, Prometheus requires to fetch metrics from an exporter container, which in this case is called rabbitmq-exporter. This container gets metrics from RabbitMQ (using RABBIT_URL variable), and publishes them in a format which is compatible with Prometheus:

# docker-compose.yml
rabbitmq:
  image:             rabbitmq:3.7.8-management-alpine
  restart:           always
rabbitmq-exporter:
  image:             kbudde/rabbitmq-exporter:v0.29.0
  restart:           always
  environment:
    RABBIT_URL:      "http://rabbitmq:15672"

Then, we have a prometheus container that reads the metrics exposed by rabbitmq-exporter:

# docker-compose.yml
prometheus:
  image:             prom/prometheus:v2.6.0
  restart:           always
  volumes:
    - ./prometheus.yml:/etc/prometheus/prometheus.yml
    - ./alerting_rules.yml:/etc/prometheus/alerting_rules.yml

The link between prometheus and rabbitmq-exporter is configured in prometheus.yml file, that is given to Prometheus container as a volume:

# prometheus.yml
scrape_configs:
  - job_name: 'rabbitmq-test'
    scrape_interval: 1s
    metrics_path: /metrics
    static_configs:
      - targets: ['rabbitmq-exporter:9090']

Which says to fetch metrics from RabbitMQ exporter every 1 second.

The next step is to configure Alert Rules to be used by AlertManager. These are also part of Prometheus configuration, and are defined in alerting_rules.yml file and provided as a volume to Prometheus container:

# alerting_rules.yml
groups:
  - name: alerting_rules
    interval: 1s
    rules:
      - alert: rabbitmqDown
        expr: rabbitmq_up == 0
        for: 10s
        labels:
          severity: critical
        annotations:
          summary: "RabbitMQ is Down"
          description: "RabbitMQ is so dead"

This file defines a rule to check if rabbitmq_up metric (exposed by rabbitmq-exporter) has the value 0 during 10 seconds. If so, a new critical Alert called rabbitmqDown will be created with some comments defined at annotations section.

Now that we have Prometheus generating Alerts, it is time for AlertManager to come in:

# docker-compose.yml
alertmanager:
  image: prom/alertmanager:v0.15.3
  restart: always
  volumes:
    - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml

Using the official AlertManager docker image, we can quickly configure a container to fetch Alerts from Prometheus and trigger Slack notifications:

# alertmanager.yml
route:
  group_by: ['instance', 'severity']
  routes:
  - match:
      alertname: rabbitmqDown
  receiver: 'tranque-slack-hook'

receivers:
- name: 'tranque-slack-hook'
  slack_configs:
  - api_url: "https://hooks.slack.com/services/your-slack-hook"
    title: "{{ .CommonAnnotations.summary }}"
    title_link: ""
    text: "RabbitMQ server got down for 10 seconds"

Given the alert called rabbitmqDown (defined in alerting_rules.yml file), we can trigger a Slack notification declared as tranque-slack-hook. That receiver configuration requires to specify a Slack hook URL, as well as a Title and Description of the message to be sent to our Slack channel. The result will look like this:

Finally, we can set up an Unsee container as a real-time Alert UI, to concentrate all current events received by AlertManager:

unsee:
  image: cloudflare/unsee:latest
  restart: always
  environment:
    ALERTMANAGER_URI: http://alertmanager:9093
  ports:
    - 8080:8080

If we forward unsee's port 8080 to our localhost, we will be able to see an interface like this:

That contains all active alerts and their corresponding descriptions and labels.

EODOC

# alert rules (to be used by AlertManager)
groups:
- name: alerting_rules
# rules evaluation period
interval: 1s
rules:
# rabbitmq metrics simple rule
- alert: rabbitmqDown
# rabbitmq is not running
expr: rabbitmq_up == 0
# during 10 seconds
for: 10s
labels:
# sounds important
severity: critical
annotations:
summary: "RabbitMQ is Down"
description: "RabbitMQ is so dead"
global:
resolve_timeout: 1m
route:
group_by: ['instance', 'severity']
# wait to send notification
group_wait: 1s
# wait to resend notification
repeat_interval: 1h
routes:
- match:
# alert name defined in alerting_rules.yml
alertname: rabbitmqDown
receiver: 'tranque-slack-hook'
receivers:
# slack hook configuration
- name: 'tranque-slack-hook'
slack_configs:
- api_url: "https://hooks.slack.com/services/your-slack-hook"
title: "{{ .CommonAnnotations.summary }}"
title_link: ""
text: "RabbitMQ server got down for 10 seconds"
version: "3.4"
services:
# set up any service to be monitored using prometheus. I chose rabbitmq with no particular reason
rabbitmq:
image: rabbitmq:3.7.8-management-alpine
restart: always
# export rabbitmq metrics to prometheus
rabbitmq-exporter:
image: kbudde/rabbitmq-exporter:v0.29.0
restart: always
environment:
RABBIT_URL: "http://rabbitmq:15672"
# prometheus server
prometheus:
image: prom/prometheus:v2.6.0
restart: always
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alerting_rules.yml:/etc/prometheus/alerting_rules.yml
# alertmanager server
alertmanager:
image: prom/alertmanager:v0.15.3
restart: always
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
# alertmanager UI
unsee:
image: cloudflare/unsee:latest
restart: always
environment:
ALERTMANAGER_URI: http://alertmanager:9093
ALERTMANAGER_PROXY: true
ALERTMANAGER_INTERVAL: 5s
ports:
- 8080:8080
# set a label to this prometheus instance
global:
external_labels:
monitor: 'tranque-monitor'
# connect to AlertManager
alerting:
alertmanagers:
- static_configs:
- targets:
- 'alertmanager:9093'
# include alert rules (to be used by AlertManager)
rule_files:
- "/etc/prometheus/alerting_rules.yml"
# read metrics from several sources
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'rabbitmq-test'
scrape_interval: 1s
metrics_path: /metrics
static_configs:
- targets: ['rabbitmq-exporter:9090']
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment