eheydrick/monitoring.md

## monitoring.md

      
    Raw
  

              monitoring.md
            
          
    Monitoring Overview

Why monitoring


Distributed systems are complex, things fail in unexpected ways
Monitoring gives you visibility into the system
Monitoring tells you stuff is broken before the customer notices

Types of monitoring


Blackbox monitoring - monitor from outside the box in. The customer view of the system. e.g. is user service up and reachable from the Internet (Monitis), can customers login (Test Service)


Whitebox monitoring - monitoring from inside the system using data provided by the system. e.g. livestats, CPU load, memory usage, disk iops


Monitoring vs alerting vs notifying


Monitor everything that could break
Alert on things that will break or are broken but low impact
Notify (page) on things that are broken and have a customer impact or will break very soon.

What to monitor..


Things that expire: domains, SSL certs
Things that can can be slow or error: latency, increase in 500 errors, exceptions
Things that can grow: queues, disk space

What to alert on


things that could be a problem or will be a problem e.g. queue is growing

What to notify on


things that are currently a problem that could affect customers e.g queue is really big, events aren't getting ingested, customers can't login
what not to notify on: a service is down on a single host, anything that isn't directly customer impacting (sleep is good), CPU, memory, network utilization (usually)

What metrics to collect


anything that moves or could move in the future

How we do monitoring + metrics


Whitebox monitoring: Sensu
Blackbox monitoring: Monitis

Components


Sensu clients + servers
Uchiwa (Sensu UI)
Grafana (Dashboards)
Monitis (External service checks)
Telegraf (Metrics agent)
OpsGenie (On-call paging)

Metrics


Metrics collected with sensu and telegraf
Stored in influxdb
Accessed with grafana
Do some alerting based on data in influxdb, e.g. timing
Have cloudwatch metrics in grafana e.g. RDS metrics, ALB metrics