Skip to content

Instantly share code, notes, and snippets.

@pshima
Created July 6, 2016 02:39
Show Gist options
  • Save pshima/2665fedbe0ae56f9fc3454d3fd1c0418 to your computer and use it in GitHub Desktop.
Save pshima/2665fedbe0ae56f9fc3454d3fd1c0418 to your computer and use it in GitHub Desktop.

Runbook for X

What is happening?

Such and such service is probably having X, that means Y.

What is the user experience when this happens?

No users will be able to do anything.

How does this alarm work?

It runs command XYZ

What are the likely causes of this?

Unicorns jumping in to the cloud

What do I need to do?

  • Ack the incident in pager duty if you haven't already
  • Ack in the #ops channel that you are working on the incident
  • If we are completely down, engage your secondary immediately
  1. Log in to X
  2. Run command Y
  3. Check X
  4. If X > B, then do C

When is the event over?

Wait 15 minutes after everything looks green to call the event resolved.

Make a personal note to yourself on the incident and anything you learned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment