Skip to content

Instantly share code, notes, and snippets.

@thraxil
Created April 11, 2017 13:54
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save thraxil/1c9ca38ca637ad51725ec85bba20dbd2 to your computer and use it in GitHub Desktop.
Save thraxil/1c9ca38ca637ad51725ec85bba20dbd2 to your computer and use it in GitHub Desktop.

The CTL Surveillance State

(for servers and applications)

Metrics and Monitoring at CTL

Once upon a time, we had a couple Dell servers living in the Mezzanine. All of our applications just ran on them.

We had a wiki page to keep track of which applications were running on which server, etc. the "Master Server Grid":

[!01_master_server_grid.png]

Then came virtual machines. One or more virtual machines runs on top of a physical server. One or more applications run on a virtual machine. So the Master Server Grid had to keep track of those as well.

[!02_master_server_grid.png]

Then LITO was running some of our VMs for us on their hardware, so we needed to keep track of those.

Then LITO didn't want to give us more VMs, so we started running some on Linode. We needed to keep track of those.

The wiki page required a lot of manual cross-referencing for common tasks, and became difficult to update, hence increasingly out of date and unreliable.

So I wrote Plexus as a database to keep track of our servers, applications, and aliases.

[!03_plexus_servers.png] [!04_plexus_aliases.png] [!05_plexus_applications.png]

If you looked at a single server, you could see its basic info and a list of the aliases/applications assocated with it

[!06_plexus_server.png]

While we were at it, now that we had a proper application there, we set it up so adding a new alias could send the email to hostmaster for us with all of the correct details in place (no more copy and paste errors).

[!07_plexus_new_alias.png]

And we now had a place to add notes as we made changes to servers:

[!08_plexus_notes.png]

Super useful later on when debugging an issue or recreating a a server elsewhere.

Around this time, LITO had set up a an instance of Graphite and gave us access to it. Graphite is a simple time-series database that stores metrics for you.

We had been using Munin for tracking metrics on our servers. It made graphs that looked like this:

[!09_munin.png]

Useful for seeing whether systems were currently healthy or if they were slowly filling up over time, etc. But Munin basically only makes graphs. Graphite makes the same graphs (but even better), but also exposes the actual data in a way that other systems can consume it.

So, first of all, that meant that we could have the graphs right there in Plexus, which really helps for investigating issues:

[!10_plexus_graphs.png]

And we can automatically generate some basic dashboards showing the status of all the servers/applications in one place:

[!11_plexus_dashboard_servers.png]

[!12_plexus_dashboard_traffic.png]

We could even put a small dashboard of basic stats directly into each of our Django apps:

[!17_pmt_stats.png]

(if you go to /stats/ on just about any of our Django apps, you can see something similar, though they are somewhat neglected at this point).

More importantly, it meant that we could set up alerts based off the metrics. For this, I wrote a simple tool called Hound. Hound just takes a list of metrics and thresholds and sends you an email if a graphite metric crosses its threshold. Since it's alerting off a metric, it can also show you a graph for the metric so you have some context:

[!13_hound_alert.png]

Of course, alerting on one metric isn't very interesting. What we really want is to be able to watch lots of metrics. At the moment, in Hound we have 267 different alerts. You get a nice summary view:

[!14_hound_dashboard.png]

Each of those green squares is an alert. If there's a problem, it turns red and you can click on it to see the details. The graphs below are a 24 and 7 day history. If you look closely, you can see a couple little red bits on the weekly graph where a couple metrics were bad, but recently (at least when I made these screenshots) things have been calm.

We watch the usual basic server metrics like the load and disk usage. Then we have an alert for every (Django) application so we know if it starts having errors or taking an unusually long time to serve responses. We also set up alerts for all kinds of miscellaneous things. Basically, any time we have an outage, as part of our post-mortem, we figure out if there were any metrics that we could've been alerting on that would've alerted us to the problem in time to prevent the outage. If so, we add those to Hound.

A few months ago, as we've been moving off of LITO's servers, we started running our own Graphite server instead of relying on theirs. One upside of doing this was that we could also set up Grafana, which is a popular open source dashboard building tool for Graphite. It makes much nicer looking interactive graphs:

[!15_grafana_graph.png]

And it lets you put together nice dashboards (all through the web):

[!16_grafana_dashboard.png]

Staff all have access to this. If you want to put together a dashboard like this, talk to a developer.

Looking to the future, we are also now experimenting with feeding (or or less) all of our logs into ElasticSearch. Grafana can do queries against ElasticSearch as well (though it's much more complicated) and ElasticSearch comes with Kibana, which has similar sophisticated graph and dashboard creation functionality:

[!18_kibana.png]

But to be honest, we are still figuring out how to use it beyond the absolute basics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment