Skip to content

Instantly share code, notes, and snippets.

@eheydrick
Last active August 8, 2018 17:34
Show Gist options
  • Save eheydrick/182daaae7dab006582880469cdc923fe to your computer and use it in GitHub Desktop.
Save eheydrick/182daaae7dab006582880469cdc923fe to your computer and use it in GitHub Desktop.
Monitoring talk

Monitoring Overview

Why monitoring

  • Distributed systems are complex, things fail in unexpected ways
  • Monitoring gives you visibility into the system
  • Monitoring tells you stuff is broken before the customer notices

Types of monitoring

  • Blackbox monitoring - monitor from outside the box in. The customer view of the system. e.g. is user service up and reachable from the Internet (Monitis), can customers login (Test Service)

  • Whitebox monitoring - monitoring from inside the system using data provided by the system. e.g. livestats, CPU load, memory usage, disk iops

Monitoring vs alerting vs notifying

  • Monitor everything that could break
  • Alert on things that will break or are broken but low impact
  • Notify (page) on things that are broken and have a customer impact or will break very soon.

What to monitor..

  • Things that expire: domains, SSL certs
  • Things that can can be slow or error: latency, increase in 500 errors, exceptions
  • Things that can grow: queues, disk space

What to alert on

  • things that could be a problem or will be a problem e.g. queue is growing

What to notify on

  • things that are currently a problem that could affect customers e.g queue is really big, events aren't getting ingested, customers can't login
  • what not to notify on: a service is down on a single host, anything that isn't directly customer impacting (sleep is good), CPU, memory, network utilization (usually)

What metrics to collect

  • anything that moves or could move in the future

How we do monitoring + metrics

  • Whitebox monitoring: Sensu
  • Blackbox monitoring: Monitis

Components

  • Sensu clients + servers
  • Uchiwa (Sensu UI)
  • Grafana (Dashboards)
  • Monitis (External service checks)
  • Telegraf (Metrics agent)
  • OpsGenie (On-call paging)

Metrics

  • Metrics collected with sensu and telegraf
  • Stored in influxdb
  • Accessed with grafana
  • Do some alerting based on data in influxdb, e.g. timing
  • Have cloudwatch metrics in grafana e.g. RDS metrics, ALB metrics
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment