Skip to content

Instantly share code, notes, and snippets.

@nodox nodox/0_monitoring.md
Last active Oct 2, 2018

Embed
What would you like to do?
Observability - How to monitor your applications

Monitoring applications with Prometheus

Today you are going to dive deep into the observability and build a monitoring solution for your container environment using Prometheus and Grafana.

What is Observability?

You know what really grinds my gear? Incomplete software tutorials. Everyone wants to be a teach the world about that shiny new tool or that brand new Javascript framework to build the application ofyour dreams. Developers will often write these elaborate tutorialsto teach you how to write X feature or implement Y functionality. The tutorial then ends with a disclaimer that you wouldn't want to run this in production. Or my favorite they will tell you how to deploy the application but not how to maintain the new piece of software that you released into the wild. I mean how are you so sure the new piece of software your released won't take down us-east-1 North Virginia data center AKA the cloud? That's right you're not sure. Well today that's all going to change.

In this tutorial we're going to take a look at things from the post-production world. We're going to build a monitoring solution that will allow us to see what our applications are doing, what resources they are consuming when they are deployed to their production environment. As you can see that where we get the term observability.

Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. the more information we get from the internal states of the system, the more visibility we have into the system which increases the obseravility of our system. There are 4 pillars of obserbavility.

  • Metrics aggregation
  • Log aggregation
  • Tracing
  • Visualization and Alerting

When we measure them, we can get a better picture of what our systems are doing in production. You might be asking why do I care? Well, with these principles you can do a couple of interesting things when your system is in an undesired state. You can:

  • visualize the state of you systems and alert when the system goes into an undesired state
  • you can improve your disaster recovery efforts during on-call incidents
  • you can measure how reliably your systems are, and determine with great confidence when you can take your system down for planned downtime

You're not limited to these benefits but surely you can see how knowing more about what your system is during when live can give you the confidence to take better risks.

Now to increase visibilty you need the proper tools to touch each one of those pillars. Today we're going to focus on metrics collection and visualization components of obsservability. Let's get started with Prometheus.

Working with Prometheus

Prometheus is an open source, time-series database, written in Golang used for metric colleciton. Prometheus has risen to high popularity due to highly dimensional day and powerful query language among other things. The prometheus model for metric collection is different than most metric collection styles because it relies on pulling metrics from targets via scraping at some determined interval. Prometheus also integrates with other services to complete its monitoring infrastructure.

Prometheus components include

  • the main Prometheus server which scrapes and stores time series data
  • client libraries for instrumenting application code
  • a push gateway for supporting short-lived jobs
  • special-purpose exporters for services like HAProxy, StatsD, Graphite, etc.
  • an alertmanager to handle alerts
  • various support tools

Monitoring with prometheus means all components are containers, you get alerting and visualization services, and you get a large ecosystem of exporters to collect metrics from commonly used applications and databases (postgres, mongodb, nginx, AWS, and more). Prometheus scrapes metrics from instrumented jobs, either directly or via an intermediary push gateway for short-lived jobs. It stores all scraped samples locally and runs rules over this data to either aggregate and record new time series from existing data or generate alerts. Grafana or other API consumers can be used to visualize the collected data.

Outline.

  • Talk about prometheus, architecture
  • Show the dockerfile
  • Walk through each section piece by piece

======

  • Its never about release strategies or what a production environment of the tutorial product looks like
  • And i know for a fact we don’t even begin to talk about how to monitor the tutorial app we release into the north virgina i.e. the internet or the cloud
    • Spongebob pic of imagination meme with caption the cloud
  • We’ll I’m here to take you down the road less travelled. We’re going to talk about how we monitor the bad pieces of software we release into the work. Cause cmon you know every line of code you wrote was shitty. especially if you’re working with javascript, for better or worst ecosystem changes every second
  • We are going to build a kickass operational monitoring stack that will give you visibility into what you app is doing and how its behaving. No more push and pray deployments.
  • We’re going to deploy grafana and prometheus stack to collect metrics and visualize the data we get from our apps and servers. Let’s get started

Let’s collect metrics with Prometheus

  • ok time to talk about the M word. No not money you greedy bastard. I’m talking metrics. In order to monitor an app or an API you need query that thing for some metrics and then store them for later analysis. For this use case we will use Prometheus.

Let’s monitor some things using metrics - exporters

  • Machine/Host Metrics - Node exporter
    • Ok so first and foremost we need to collect metrics form the computer nodes in our closet
    • This is good to know how exhausted
  • Container metrics - cadvisor exporter
    • now we can collect container metrics to analyzes and exposes resource usage and performance data from running containers
  • Github repo metrics - github exporter
    • for added fun let’s monitor something more interesting like metrics from a GitHub repo
    • this exporter lets us monitor for example repo traffic, stars, issues and pull requests

Is Production still running? - Alerting

  • Its not enough to collect metrics you have to be alerted when something goes wrong. Last deploy killed your site? You should know about it immediately

What about short lived jobs - Pushgateway

  • an intermediary service which allows you to push metrics from jobs which cannot be scraped.
  • You might not need this right now but let’s say you have short lived jobs like a lambda function that fires every time a user buys a product.
  • You still want those metrics even though its doesn’t have a server endpoint to query.
  • Thats where we use push gateway. Instead of using the normal pull architecture prometheus exposes, we can push our metrics once the job is finishing to push gateway which aggregates them and then prometheus can scrape pushgateway
  • Trust me you’ll thank me later, plus its no work with docker compose
version: '3.7'
networks:
monitor-net:
driver: bridge
volumes:
prometheus_data: {}
grafana_data: {}
services:
prometheus:
image: prom/prometheus:v2.3.2
container_name: prometheus
volumes:
- ./prometheus/:/etc/prometheus/
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--web.enable-lifecycle'
restart: unless-stopped
expose:
- 9090
ports:
- 9090:9090
networks:
- monitor-net
labels:
org.label-schema.group: "monitoring"
depends_on:
- influxdb
alertmanager:
image: prom/alertmanager:v0.15.1
container_name: alertmanager
volumes:
- ./alertmanager/:/etc/alertmanager/
command:
- '--config.file=/etc/alertmanager/config.yml'
- '--storage.path=/alertmanager'
restart: unless-stopped
expose:
- 9093
networks:
- monitor-net
labels:
org.label-schema.group: "monitoring"
nodeexporter:
image: prom/node-exporter:v0.16.0
container_name: nodeexporter
user: root
privileged: true
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($$|/)'
restart: unless-stopped
expose:
- 9100
networks:
- monitor-net
labels:
org.label-schema.group: "monitoring"
github-exporter:
image: gatsbymanor/github-exporter:latest
container_name: github-exporter
tty: true
stdin_open: true
restart: unless-stopped
expose:
- 5001
networks:
- monitor-net
environment:
- REPO=gatsbyjs/gatsby
- GITHUB_TOKEN=6120e58326cf0eb38c344eb7333de50f9db61cb3
labels:
org.label-schema.group: "monitoring"
cadvisor:
image: google/cadvisor:v0.28.3
container_name: cadvisor
volumes:
- /:/rootfs:ro
- /var/run:/var/run:rw
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
#- /cgroup:/cgroup:ro #doesn't work on MacOS only for Linux
restart: unless-stopped
expose:
- 8080
networks:
- monitor-net
labels:
org.label-schema.group: "monitoring"
grafana:
image: grafana/grafana:5.2.2
container_name: grafana
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/datasources:/etc/grafana/datasources
- ./grafana/dashboards:/etc/grafana/dashboards
- ./grafana/setup.sh:/setup.sh
entrypoint: /setup.sh
environment:
- GF_SECURITY_ADMIN_USER=${ADMIN_USER:-admin}
- GF_SECURITY_ADMIN_PASSWORD=${ADMIN_PASSWORD:-admin}
- GF_USERS_ALLOW_SIGN_UP=false
restart: unless-stopped
expose:
- 3000
ports:
- 3000:3000
networks:
- monitor-net
labels:
org.label-schema.group: "monitoring"
pushgateway:
image: prom/pushgateway
container_name: pushgateway
restart: unless-stopped
expose:
- 9091
networks:
- monitor-net
labels:
org.label-schema.group: "monitoring"
influxdb:
image: influxdb:1.6
volumes:
- ./influxdb/:/etc/influxdb/
environment:
- INFLUXDB_DB=prometheus
- INFLUXDB_HTTP_AUTH_ENABLED=false
ports:
- 8086:8086
- 8083:8083
tty: true
networks:
- monitor-net
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.