The Lifecycle of an Outage
Scott Sanders -- firstname.lastname@example.org
So we're all here at Monitorama, and it's awesome to see so many incredible people in one place focusing on such an important topic. I'd like to talk a bit about outage lifecycles and how the monitoring and alerting tools we're familiar with can be woven into processes that enable confidence.
When an incident occurs, we typically have an increased risk of an outage. How we structure our initial response, our decision making process, and our communication directly affects the impact that this incident will have. We need to think critically about our ability to quickly resolve any problems and reduce the risk of future incidents.
It’s so freaking awesome to be an operations engineer right now, because we have better tools than we've ever had before, but in my mind, they're totally ineffective if we aren't using them to iterate towards better availability.
Availability is the key metric. I'm going to simplify a bit just in case anyone here doesn't agree with me. Take any other vital business metric and multiply it by your availability during an outage.
How many new users sign up when you're down?
How many new repositories are created when GitHub is down?
When you're down, it's none. And that sucks.
But everyone has outages. We're running complicated systems that are constantly in motion and constantly evolving.
What can we do?
We need to study the lifecycle of an outage, from the initial trigger to post-mortem. We need to take this information and learn from it so we can iterate on our tooling and iterate on our processes. We need to take steps to decrease the surface area of potential problems so we can increase our availability, even as our environment increases in complexity.
I really like this quote from Sidney Dekker, "Human error is not random. It is systematically connected to features of people's tools, tasks and operating environment.”
I find that a lot of outages are the result of human error. A bad deploy, or a unsafe config management change, or even failure to do effective capacity planning. These are all mistakes that humans make.
What can we learn from an outage that helps us prevent human error?
Let's start by looking at how humans get involved in an emerging incident.
The trigger is two things. It's the detection of the problem coupled with the notification. Let's assume, for now, that we've detected a problem somewhere, and our monitoring platform has decided its time to generate an alert and get a human involved.
How can we package that information? What can go wrong? And how can we learn from past outages?
Generate meaningful alerts. Alert fatigue is serious problem that causes on-call personnel to tune out notifications because there's too much noise and not enough signal. If a computer or repair process can fix the problem, then do not generate an alert.
Human fatigue is another problem. It's unrealistic to expect anyone to have a quick reaction time when they get paged if they're exhausted or operating without sleep.
Use the shortest rotation schedule you can.
On the GitHub Ops team we've set up out first responder rotation to have 24 hour shifts. Hopefully as we continue to grow, we can make them even shorter.
Everyone poops. Deal with it. You can't react effectively to an alert if you're in the shower or getting a haircut, but with a little teamwork this isn't a problem.
We use a chatops command called 'pager me' that let's any engineer borrow the pager from the on-call engineer for fixed amount of time so they can take a break or take a nap or take their kids to school.
Be persistent with your notifications. Don't page someone every 15 minutes, page them once, and then page them again, and again, and again, every minute until the alert is acknowledged.
Seconds count and it's a lot harder to overlook an notification if its being delivered over and over again.
Escalate quickly. Don't let a dead cell phone battery or a silenced ringer or a downed cell tower create an outage. If the on-call engineer doesn't respond, then wake someone else up.
Be loud. Alert over different mediums. Tell your chatroom that something is wrong. Page phones. Send emails. Wake up your on-call team, but try to inform as many people as possible that something is wrong.
Create a handoff report for every on-call shift so that recurring notifications don't get overlooked.
We use a chatops command creatively named ‘/handoff’ to scan all of the alerts that happened during a shift, organize them, graph them, and drop them into an issue. When your shift ends, just add a few notes about the state of the system and mention whoever is on deck so they aren't caught off guard.
Now, back to the outage lifecycle. We're clued in that a problem exists, so let's look at our initial response.
This is usually an expensive context-switch for whoever is first on the scene. They need to establish command and identify how severe the problem is. And they need to do it quickly so corrective actions can be taken before things get worse.
Here’s an actual alert. The notification we see in chat gives us a graph of the problem, which we've found is a great way to make that context switch from sleep, or coding, or whatever, just a bit easier.
The '/pager' command is our gateway to the notification system. With it we can review in-progress alerts and acknowledge, escalate, or resolve them. When you run '/pager ack' in our Ops chatroom, you're not just acknowledging the alert, you're signaling to everyone present that you're in charge.
So, the next function you've got to perform is to figure out just how broken things are and who's being affected. Some alerts indicate a serious problem and we'll go straight to warning status, but many aren't so obvious. This is where we lean heavily on our monitoring and tooling to guide us.
But, in order to be effective we need to approach this in a way that can be iteratively improved.
At GitHub we have a strong culture of working together in chat to solve problems. We also make this really cool website that helps us collaborate on software development, so naturally we want to leverage these things when developing our interactions with our monitoring systems.
Let's take a quick look at how we gather data, how we present it, and how we interact with it. For starters, we drive an enormous amount of information into Graphite. We've got 5 machines dedicated to this task with a collective 80 CPU cores, 640 gigs of RAM, and 10 terabytes of solid state disks.
Server level metrics are typically gathered by collectd and sent back to graphite. We use many of the open source plugins to cover the basics like disk utilization, cpu, memory, network protocol stats and so on. And we’ve developed some of our own plugins and exec scripts to parse out and report on information specific to our environment.
Application level metrics are instrumented with statsd. Statsd is such a powerful way to sample realtime events and record statistics. There are client libraries out there for nearly every language and its incredibly developer friendly. When I checked this morning we were handling nearly 4 million statsd messages per second.
So we’ve got all these data sources and we're hammering this graphite cluster with more than 175 thousand graphite updates every second.
Graphite is an incredible workhorse.
It's a huge part of our metrics culture and we leverage it accordingly. I also love talking about it, so if you want to chat about how to scale graphite, come find me later.
Our logging pipeline feeds data into Splunk. Curt Micol's scrolls library provides an easy-to-work-with logging framework for our applications. Combine that with some tooling to make syslog-ng aware of our other log sources and we have a robust logging infrastructure thats indexing a terabyte of data every day.
Rounding out the monitoring platform is a raft of smaller, special purpose, data collection and visualization systems. These fill in the gaps and provide coverage for the things in our environment that are specific to our business.
At GitHub we don't consider any of these tools ready for production unless we can interact with them in chat. As an engineer queries different systems during an incident, that person is working in a shared communication space. Everyone present can see the data that drives decision making. This means we're writing code to interact with our data. And when we write code at GitHub we're creating opportunities for collaboration. Opportunities to iterate and opportunities to improve.
So create tools to interrogate your environment. We've got wrappers like '/graph me', which renders the requested graph in graphite, then saves the resulting image on a server and displays it in chat. We can tell graph me what time range we're interested in, and we can add extra parameters. The graph in chat also links back to the editor so we can perform ad-hoc exploration on an existing data query and save the changes for future use.
Managing your metrics in graphite kind of sucks. It’s 2014, why are we still using ExtJS? To make it a bit less painful add tooling. Metrics are kind of like cats, so I created a few composable binaries to help you herd metrics around regardless of how your cluster is designed. Using the building blocks in carbonate, I can create chatops to search your metrics using fancy things like a regular expression. Or do more complicated things like bulk rename metrics, or delete them.
We also have tooling to work with ‘/splunk’…
or ‘/nagios’. I could easily spend the rest of this talk telling you about our chatops tooling and still have barely scratched the surface. Go look at the hubot-scripts repo or watch talks by Jesse Newland or Josh Nichols if you want to know more. But the takeaway here is that you need to accept the processes that emerge through culture and experience, then adapt your tooling to augment these processes, not the other way around.
OK. Enough about that, lets get back to the outage lifecycle. At this point we've been informed something is wrong, we've established command, and we’ve oriented ourself to the problem. It's time to take action and fix the problem.
Like every other step in the outage lifecycle, if we want to iterate towards less risk and better availability, we have to think about how we make changes to our system.
How can we build collective knowledge around changes? How do we create feedback loops and where can we put them? How can our tools build confidence?
The best way to explain some of the things we've learned is to look at a real example of an outage.
In the fall of last year, GitHub was hit with a couple of significant distributed denial of service attacks. I'm going to tell you about one of them. If we look at the initial alert coming from our notification system, you can see an issue being created in our 'nines' repository, but more on that later. One of our engineers, Aman was in the Ops chatroom and thought that notification looked kind of important.
So he decides to take a look at the current traffic. That big blob of inbound traffic looks like its growing and is definitely not normal. So he confirms the DDoS and generates another alert. This is what I meant by be loud earlier. When things are busted and action needs to be taken, its super important to get the word out.
And it worked. It got my attention. Furthermore, as soon as I joined Chat, I could could see the information Aman had been looking at.
This particular incident happened while we were still getting the hang of mitigating DDoS attacks, but we'd already learned a lot from previous attacks and were starting to build better alerts and more reliable tooling to let us deal with them.
Back to that nines thing, all of our alerts create an issue in our 'nines' repo. This issue gets created by Hubot and is pre-populated with context around what triggered the notification. In this case our traffic management system detected a traffic anomaly and generated an alert. In the issue body we add a link to a playbook for that specific issue. Our playbooks live in git and are rendered on and off-site. They outline common triage patterns for each service, procedures for bringing the system back into balance, and who to contact if you need to consult an expert on that service.
Playbooks are such a great way to codify your emergency response and standard operating procedure. We're constantly using feedback from the outage lifecycle to grow and enhance our playbooks. They let us institutionalize knowledge. They also provide examples of how our operational tooling works.
The more we learn, the better our playbooks become.
Distributing knowledge and training people how to operate parts of your infrastructure is difficult. Create tools to do your job. If you find out something is broken, then you can use that knowledge to add monitoring coverage. And as you fix problems and uncover the nuances of a system, you can use that expert understanding to build tools to hide the fact that all software is terrible.
Modern infrastructure is complex, and usually built from dozens or hundreds of individual services. You quickly hit a scalability limit if everyone on the team is required to understand the entire system. So build tools that hide the horrible parts of the software we have to operate, make them safe for less experienced engineers, and document them in your playbooks.
In this case, I knew it was time to bring up some serious defenses. And serious chatops can be scary. We’ve learned to treat actions that cannot easily be aborted with respect, and always require confirmation before continuing. So, when I ask Hubot to bring our shield generator online, he asks me a question to make sure I mean what I say.
I repeat myself with the confirmation token, and let Hubot do all of the work. This is super powerful, because the process to enable our shield is complicated and takes some pretty careful timing. At three in the morning, I really don’t want to risk making a typo on a border router or forgetting some small, but incredibly crucial step, so I've taught our robot how to safely get the job done.
Once Hubot is done and the attack traffic has been mitigated, we again fall back to our monitoring platform and verify the actions taken had the expected result. In this graph we can see that the DDoS traffic hitting our border has vanished. GitHub is safe. We can also see a couple other things. Specifically, the monitoring platform did a crappy job of recording data during the attack, and it definitely took longer than we’d like to fully mitigate the attack.
Which bring us to the last part of the outage lifecycle. We need to close the loop.
We're not done when the outage is resolved. We have to bring together all of the information about an incident and persist it so we can guide every iteration towards better availability and reduced risk.
Document everything that jeopardizes your availability. At GitHub we review the chat logs to create the timeline of the event, we'll link the graphs we saw, and build more supporting evidence. We have dedicated a repository just to tracking these issues. It's our availability repo.
Once the issue is created in the availability repo the collaborative magic can begin. Development teams will be mentioned and brought in for their expertise. Business teams and support teams are mentioned so they can help communicate what happened to a larger audience. In depth research happens and solutions are proposed.
In the attack I described we identified a number of things that could have been done better. We should have been able to automatically trigger mitigation based on the traffic characteristics. The misconfigured ACL that allowed attack traffic to disrupt monitoring definitely needed to be fixed. Hell, we didn't even generate an alert notifying someone that monitoring data was being dropped.
Sometimes an availability issue results in a public post-mortem, but it always results in improvements across the infrastructure that reduce future risk. This isn’t an overnight change. These issues can remain open for months as we pick apart the outage and address everything we find. We use issue references across repos to link individual pull requests back to the availability issue they address and only close the availability issue when we’re satisfied things are fixed.
These issues are one of the primary ways we close the loop on an outage and big part of how we collaborate towards prevention. By studying these outages we've been able to make significant progress towards preventing denial of service attacks from impacting our availability.
Our ability to automatically profile and mitigate attack traffic and the alerting around it has come a long way.
We're also continuously scanning for changes to our attack surface and creating issues when changes happen so we have up to date ackles and increased awareness.
Things are better, but the work is never done.
Availability is a huge part of why we build out all of this monitoring and why we create so many tools. Availability is the single most important metric to your operations team, and arguably your business. Study the lifecycle of your outages and learn from them. Let all of these amazing monitoring tools help you, but remember...
Your tools are complementary to your process, not the other way around.
Communication is the cornerstone for effective incident management.
Leverage the combination of process and tooling to enable confidence.
and… Never stop iterating on emergency response.