case-eee/rfc.md

## rfc.md

      
    Raw
  

              rfc.md
            
          
    Let's improve current on call process

Summary

This is a proposal to make improvements to our current on call process and rotation by sharing the burden of this responsibility across more engineers on the team. The proposal includes creating two separate rotations, splitting the responsibiltiies of these two rotations, and adding everyone that's a software engineer level 1 (frontend or backend) or higher to the new rotation. Continue reading for more details!
Motivation

Currently, we only have four engineers on the on call rotation which means every two weeks (because we always have a primary and a secondary person on call), every individual has on call responsibilities. This is not sustainable. We also want to improve developer happiness and engagement and prevent putting our quarterly objectives in danger due to on call responsibilities taking up a non-insignificant amount of time and effort every two weeks. An additional benefit is an increase in opportunities for more folks to get familiar with more parts of our application by actively triaging issues that arise about once a quarter.
Solution

We'd like to propose two rotations instead of one:

Tier 3 rotation
Honeybadger rotation

Tier 3 rotation

The first rotation is the Tier 3 rotation (similar to what we currently have). This rotation lasts for a week at a time and will be responsible for the following:

Triaging any Pager Duty alerts
Taking on Tier 3 requests (when tech services needs assistance with urgent issues)

We'd have a primary individual and a secondary (as we do right now) and nothing would change with the folks who are currently on this rotation (Tom, Steve, Josh, and Ross).
Honeybadger Rotation

The second rotation is the Honeybadger rotation. We'd create a new channel that is dedicated to only Honeybadger issues. This rotation lasts for a week at a time and the responsibilities include the following:

Triage any error that appears in the new channel

If it's user facing, submit a Zendesk ticket after you've looked into what might be causing the issue
If it isn't user facing, write a CH story and assign it to the squad who is responsible for this area of the application as listed here.


This rotation would consist of every full-time engineer (frontend or backend) that's a Software Engineer Level 1 or higher. Currently, that includes 9 individuals which means each person will be on this rotation about once a quarter. Casey will own coordinating this rotation with the Tier 3 rotation to ensure that there isn't any overlap. No individual that's on the Tier 3 rotation will ever be on the Honeybadger rotation at the same time.
Additionally, there are DataDog alerts that are pushed to the current #devops-application channel. These are often triaged by DevOps or the individual on the current on call rotation. I'd like to propose that these become primarily DevOp's responsibility moving forward because they typically deal with these devops-specific issues.
Adoption Strategy

In order to adopt this, we'd need to introduce the following:

A new lesson that equips all engineers to triage Honeybadger errors
A new channel that only surfaces HB errors
Ensure everyone has Honeybadger access
Create a HB rotation (and likely calendar invites) schedule

Caveats

This shares the burden of the current on call responsibilities with more folks on the team, but this will only work if we are intentional about writing an effective lesson that will equip folks that have never triaged Honeybadger errors to feel comfortable and confident triaging them. Additionally, we'll need to continue supporting folks if they run into an issue and they don't have clear steps towards a solution.
Alternatives

We've considered having the on call person dedicate 100% of their time to on call responsibilities to help decrease the amount of context switching and ensure their squad doesn't expect any squad related work during their time on call. We think this will have a much bigger impact on the squad's quarterly objectives than we'd like to have
Unresolved questions

How do we measure success of this change? One idea here is potentially asking folks on each rotation how they are feeling about the amount of time they've spent on these responsibilties.