coldclimate/monitoring-alerting-mvp.markdown

## monitoring-alerting-mvp.markdown

      
    Raw
  

              monitoring-alerting-mvp.markdown
            
          
    Monitoring and Alerting Minimum Viable Product

A checklist for those attempting to only get out of bed when it's important and to be able to debug critial and non-critial issues.
Those emphesised should probably get you out of bed when they're too high/low/gone.
This is super opinionated but I welcome feedback.  It's biased to retrofitting/cleaning up/brownfield type work because that's what I know best.
HTTP(S) Services

What you're serving, how much of it and how fast

Inbound traffic volume ("requests") (too low)


Group by function (subdomain/high level URL)


Group by source


Ideally both of the above


Outbound traffic volume ("responses")


Group by function (subdomain/high level URL) if possible


Group by HTTP status code groups


Good traffic: 2xx (too low)


Specific traffic 3xx, 4xx


Bad traffic 5xx (too high)


Responce times


Group by function (subdomain/high level URL) if possible


HTTP/HTTPS split ratio
External synthetic user jouney test eg. from ourside of your infrastructure, test your service like a user will use it (too slow, too broken)

Web Servers

Apache


Total connections
Worker statuses


Idle


Reading


Sending


Waiting


Nginx


Specific metrics
That should be monitored
When using
Nginx

Load Balancers

Whats coming in and how well are we handing it off.  You might also monitor your over all HTTP(S) Services metrics from your load balancer, but in addition to those...

Status of backend pool members
Response rates from pool members
Traffic levels to each pool member

HAProxy


Specific metrics
That should be monitored
When using
HAPProxy

Nginx


Specific metrics
That should be monitored
When using
Nginx as a load balancer

Databases


Query volume
Query responce times


Grouped by query pattern


Top queries


By frequency


By returned volume


By execution length (slow queries)


Replication statistics
Table Statistics


Read volumes


Write volumes


MySQL


Specific metrics
That should be monitored
When using
MySQL

Queues

RabbitMQ


Specific metrics
That should be monitored
When using
RabbitMQ