thraxil/ctl.markdown

## ctl.markdown

      
    Raw
  

              ctl.markdown
            
          
    The CTL Surveillance State

(for servers and applications)
Metrics and Monitoring at CTL

Once upon a time, we had a couple Dell servers living in the
Mezzanine. All of our applications just ran on them.
We had a wiki page to keep track of which applications were running on
which server, etc. the "Master Server Grid":
[!01_master_server_grid.png]
Then came virtual machines. One or more virtual machines runs on top
of a physical server. One or more applications run on a virtual
machine. So the Master Server Grid had to keep track of those as well.
[!02_master_server_grid.png]
Then LITO was running some of our VMs for us on their hardware, so we
needed to keep track of those.
Then LITO didn't want to give us more VMs, so we started running some
on Linode. We needed to keep track of those.
The wiki page required a lot of manual cross-referencing for common
tasks, and became difficult to update, hence increasingly out of date
and unreliable.
So I wrote Plexus as a database to keep track of our servers,
applications, and aliases.
[!03_plexus_servers.png]
[!04_plexus_aliases.png]
[!05_plexus_applications.png]
If you looked at a single server, you could see its basic info and a
list of the aliases/applications assocated with it
[!06_plexus_server.png]
While we were at it, now that we had a proper application there, we
set it up so adding a new alias could send the email to hostmaster for
us with all of the correct details in place (no more copy and paste
errors).
[!07_plexus_new_alias.png]
And we now had a place to add notes as we made changes to servers:
[!08_plexus_notes.png]
Super useful later on when debugging an issue or recreating a a server
elsewhere.
Around this time, LITO had set up a an instance of Graphite and gave
us access to it. Graphite is a simple time-series database that stores
metrics for you.
We had been using Munin for tracking metrics on our servers. It made
graphs that looked like this:
[!09_munin.png]
Useful for seeing whether systems were currently healthy or if they
were slowly filling up over time, etc. But Munin basically only
makes graphs. Graphite makes the same graphs (but even better), but
also exposes the actual data in a way that other systems can consume
it.
So, first of all, that meant that we could have the graphs right there
in Plexus, which really helps for investigating issues:
[!10_plexus_graphs.png]
And we can automatically generate some basic dashboards showing the
status of all the servers/applications in one place:
[!11_plexus_dashboard_servers.png]
[!12_plexus_dashboard_traffic.png]
We could even put a small dashboard of basic stats directly into each
of our Django apps:
[!17_pmt_stats.png]
(if you go to /stats/ on just about any of our Django apps, you can
see something similar, though they are somewhat neglected at this
point).
More importantly, it meant that we could set up alerts based off the
metrics. For this, I wrote a simple tool called Hound. Hound just
takes a list of metrics and thresholds and sends you an email if a
graphite metric crosses its threshold. Since it's alerting off a
metric, it can also show you a graph for the metric so you have some
context:
[!13_hound_alert.png]
Of course, alerting on one metric isn't very interesting. What we
really want is to be able to watch lots of metrics. At the moment,
in Hound we have 267 different alerts. You get a nice summary view:
[!14_hound_dashboard.png]
Each of those green squares is an alert. If there's a problem, it
turns red and you can click on it to see the details. The graphs below
are a 24 and 7 day history. If you look closely, you can see a couple
little red bits on the weekly graph where a couple metrics were bad,
but recently (at least when I made these screenshots) things have been
calm.
We watch the usual basic server metrics like the load and disk
usage. Then we have an alert for every (Django) application so we know
if it starts having errors or taking an unusually long time to serve
responses. We also set up alerts for all kinds of miscellaneous
things. Basically, any time we have an outage, as part of our
post-mortem, we figure out if there were any metrics that we could've
been alerting on that would've alerted us to the problem in time to
prevent the outage. If so, we add those to Hound.
A few months ago, as we've been moving off of LITO's servers, we
started running our own Graphite server instead of relying on
theirs. One upside of doing this was that we could also set up
Grafana, which is a popular open source dashboard building tool for
Graphite. It makes much nicer looking interactive graphs:
[!15_grafana_graph.png]
And it lets you put together nice dashboards (all through the web):
[!16_grafana_dashboard.png]
Staff all have access to this. If you want to put together a dashboard
like this, talk to a developer.
Looking to the future, we are also now experimenting with feeding (or
or less) all of our logs into ElasticSearch. Grafana can do queries
against ElasticSearch as well (though it's much more complicated) and
ElasticSearch comes with Kibana, which has similar sophisticated graph
and dashboard creation functionality:
[!18_kibana.png]
But to be honest, we are still figuring out how to use it beyond the
absolute basics.