Skip to content

Instantly share code, notes, and snippets.

@case-eee
Last active May 30, 2019 19:30
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save case-eee/50aeb29467223924cc026a972cbe74c0 to your computer and use it in GitHub Desktop.
Save case-eee/50aeb29467223924cc026a972cbe74c0 to your computer and use it in GitHub Desktop.
on call rfc

Let's improve current on call process

Summary

This is a proposal to make improvements to our current on call process and rotation by sharing the burden of this responsibility across more engineers on the team. The proposal includes creating two separate rotations, splitting the responsibiltiies of these two rotations, and adding everyone that's a software engineer level 1 (frontend or backend) or higher to the new rotation. Continue reading for more details!

Motivation

Currently, we only have four engineers on the on call rotation which means every two weeks (because we always have a primary and a secondary person on call), every individual has on call responsibilities. This is not sustainable. We also want to improve developer happiness and engagement and prevent putting our quarterly objectives in danger due to on call responsibilities taking up a non-insignificant amount of time and effort every two weeks. An additional benefit is an increase in opportunities for more folks to get familiar with more parts of our application by actively triaging issues that arise about once a quarter.

Solution

We'd like to propose two rotations instead of one:

  1. Tier 3 rotation
  2. Honeybadger rotation

Tier 3 rotation

The first rotation is the Tier 3 rotation (similar to what we currently have). This rotation lasts for a week at a time and will be responsible for the following:

  1. Triaging any Pager Duty alerts
  2. Taking on Tier 3 requests (when tech services needs assistance with urgent issues)

We'd have a primary individual and a secondary (as we do right now) and nothing would change with the folks who are currently on this rotation (Tom, Steve, Josh, and Ross).

Honeybadger Rotation

The second rotation is the Honeybadger rotation. We'd create a new channel that is dedicated to only Honeybadger issues. This rotation lasts for a week at a time and the responsibilities include the following:

  • Triage any error that appears in the new channel
    • If it's user facing, submit a Zendesk ticket after you've looked into what might be causing the issue
    • If it isn't user facing, write a CH story and assign it to the squad who is responsible for this area of the application as listed here.

This rotation would consist of every full-time engineer (frontend or backend) that's a Software Engineer Level 1 or higher. Currently, that includes 9 individuals which means each person will be on this rotation about once a quarter. Casey will own coordinating this rotation with the Tier 3 rotation to ensure that there isn't any overlap. No individual that's on the Tier 3 rotation will ever be on the Honeybadger rotation at the same time.

Additionally, there are DataDog alerts that are pushed to the current #devops-application channel. These are often triaged by DevOps or the individual on the current on call rotation. I'd like to propose that these become primarily DevOp's responsibility moving forward because they typically deal with these devops-specific issues.

Adoption Strategy

In order to adopt this, we'd need to introduce the following:

  • A new lesson that equips all engineers to triage Honeybadger errors
  • A new channel that only surfaces HB errors
  • Ensure everyone has Honeybadger access
  • Create a HB rotation (and likely calendar invites) schedule

Caveats

This shares the burden of the current on call responsibilities with more folks on the team, but this will only work if we are intentional about writing an effective lesson that will equip folks that have never triaged Honeybadger errors to feel comfortable and confident triaging them. Additionally, we'll need to continue supporting folks if they run into an issue and they don't have clear steps towards a solution.

Alternatives

We've considered having the on call person dedicate 100% of their time to on call responsibilities to help decrease the amount of context switching and ensure their squad doesn't expect any squad related work during their time on call. We think this will have a much bigger impact on the squad's quarterly objectives than we'd like to have

Unresolved questions

How do we measure success of this change? One idea here is potentially asking folks on each rotation how they are feeling about the amount of time they've spent on these responsibilties.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment