Skip to content

Instantly share code, notes, and snippets.

@atharrison
Created February 6, 2014 19:56
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save atharrison/8851446 to your computer and use it in GitHub Desktop.
Save atharrison/8851446 to your computer and use it in GitHub Desktop.
PagerDuty story

We've been using PagerDuty at Return Path for about a year now, to keep track of on-call schedules and alert only the right person on those pesky over-night outages. A few months ago I received an alert that one of our user shards was down (actually, I received many alerts, as the rippling effects made their way across the system). It was 3am, but I didn't have all the tools to solve this on my own, so had to resort to escalating to my manager.

We slogged through the issue together, isolating the shard, replacing the machine via the cloud, spinning everything back up, and watched as things recovered. Along the way, we revised our on-call triage page related to this issue. But that's not the best part. See, a month later, when I was back on-rotation again (I knew I was on-call because PagerDuty sent me a text earlier in the week), another shard went down at 3am. But this time there was no panic- simply groggy but calm assurance that I knew what I was doing and everything would be fine by morning (and it was).

http://blog.pagerduty.com/2014/02/you-saved-the-day-now-get-recognized/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment