Skip to content

Instantly share code, notes, and snippets.

@matthiasr
Last active April 13, 2024 19:44
Show Gist options
  • Save matthiasr/d96c40134af61850d7e4f8a523c0ed0d to your computer and use it in GitHub Desktop.
Save matthiasr/d96c40134af61850d7e4f8a523c0ed0d to your computer and use it in GitHub Desktop.
My company is going through a shift to more agile processes. How does work come into ops teams?

In my experience, there are three major sources of work for an opsy team:

  • Keeping the lights on – updates, maintenance, keeping entropy at bay. You can barely control how much of this there is, however you can control how much work it is to deal with through
  • Work to improve your own situation: automating stuff, architecting better systems, evolving processes, getting dev teams to do something in more ops friendly ways. Both of these compete for focus with
  • Support work, that is helping product teams in all the things they need to build product. This is almost always interrupt-driven, you can try to get wind of these needs ahead of time but if you insist on exact requests in time for your own sprint planning, you're instantly at least doubling the lead time on everything.

Things I have seen work to deal with this:

  • At an appropriate granularity of organization (by default, per team), designate a rotating first responder. This is the person who takes in short term requests on whatever communications channel your company uses. It's helpful to designate a clear place for people who need this support to go, that's not like shouting into a void. I don't believe in hiding behind a ticket queue, but that also comes from cultures where tickets are not the main communication medium. Adopt whatever channel your devs would use to your needs. A natural choice for this is to have the current on-call handle this, if paging load allows (and if it doesn't, that's a problem to put planned resources into fixing). Exempt them from any planning for this time, in fact discourage working on planned work at all. They can also handle low urgency alerts, doing that somewhat consistently helps keep pager load at bay.
  • Incoming work that takes longer than some threshold (say an hour or two) gets funneled into a ticket by the first responder (directly or by asking the person asking to create one). A simple integration between your communication system and ticket system can be helpful, like creating JIRA tickets from Slack messages – that's usually available as a feature anyway. Always link the original request message, this way the context carries over. This can then get slotted into the queue of planned work (or denied) by whatever planning process you want to use.
  • Think of planning more like Kanban than Scrum: workload is unpredictable, so trying to hit arbitrary, self imposed sprint goals only leads to burnout. Instead have a clear ordering of the next ~2 weeks of concrete broken down work items, and 1-2 quarters of larger epics or topics to tackle. This ordering will change as time progresses, see regular planning as the check-in point to do that. At both levels, avoid gold plating by being explicit how much time you are willing to spend on a given task or epic, roughly. So instead of estimating how much work something will be, evaluate how much investment it is worth. Ops work can be very elastic in this, e.g. do you just do the needful right now, do you spend an extra half day documenting it, or an extra week automating it? This is a question you have to continually answer and the answer will change based on the item at hand and what else is important right now.
  • The first responder can shield the rest of the team from some background noise of unplanned work, but this will not be complete. They will need expertise from others – in that case, I find it prudent to establish that the first responder (as load permits) participate in the resolution as a way to spread knowledge; this works well because who is first responder rotates, while the incoming topics doesn't vary based on their individual expertise so they're naturally exposed to a wide variety of topics.
  • Opsy people also tend to want to improve things along the way – leave space in the plan to do that. They're the ones dealing with the systems all day, if they get really annoyed at some friction, or really worried about a minor error, and spend half a day finally fixing it: on average, let them, they probably know better than their manager what's worth fixing. Put in some regular checkins as a guard rail and encourage mentioning these things rather than keeping them under the radar. For example, encourage creating tickets (and labeling them as unplanned) for any side work that took more than an hour. That side work is happening anyway, by being open to it you get it out of the shadows and can either use it as signal for planning, or gently put a lid on a rabbit hole.
  • Avoid the trap of hyper-specialization within a team. The managers' and ICs' first instinct will be to assign each task to whoever knows how to solve it the fastest. This is efficient in the short term (and can help through a crunch) but in the long run, locks you into the set ways of these people and makes you dependent on each individual's availability. By forcing work outside of this local optimum, with support from the experts, again you spread knowledge and improve the smoothness of your flow. Ideally, have the expert write the ticket with what needs to be done and then say "I know this best, so I won't do it but I am available for questions and pairing". On that note, pairing within reason is a powerful tool. Often I'd use pairing to get someone started on a task or to help them through a difficult spot, then let everyone go off and do their own thing for a while.
  • Larger architectural work tends to brew for a while before it becomes concrete. Look or ask for patterns in unstable systems, frequently asked questions, friction that comes up in complaints or retros. Have discussions about it when it's still on the 3-6 month horizon, involving architects or experienced engineers as appropriate, to build alignment on a reasonable problem statement, the range of solutions, the size of investment you want to make. This may look the same as a waterfall plan, but it's much more loose. The goal at that point in time is only to circumscrube the project, not define everything that happens within it. By having a clear direction of what you want to achieve, the people driving it when it comes around can make the details up as they go along (again with appropriate guard rails to communicate what they're doing).
  • As important as "how does work come in" is "how do you decide not to do a piece of work". There's always more to do and more that could be done. There will be ideas or requests that will simply never be high enough the priority list to get done. Within a project there will be ideas or goals that are nice but not really worth it. Often you'll see these get pushed to the end because nobody thinks they're important enough to do right now but of course they should be done some day by someone. Be aggressive in closing those when the time you think the project is worth is running out. Reassure yourself and others that the good ideas are not truly lost, the tickets are still there, and you can reopen them should they become more pressing again. But try not to accumulate an ever growing backlog of work that you want to get to some day but realistically never will. It's frustrating for the people who asked for it and demoralizing for the team.

In short, don't get caught up in the orthodoxy of Agile™, it is not built for operational teams. Instead, be a center of resilience and flexibility that can help move aside roadblocks on short notice, while protecting some capacity to improve the systems and tools of the opsy teams themselves.

I am a strong believer in putting dev teams on call for the systems they own, and at the same time I do not believe that this changes anything about what I wrote. There will always be additional expertise and support needed from the opsy teams, and more systems to build and support that allow this to happen in the first place.

I would recommend that you read Seeking SRE, a less well known side shoot of the SRE book series – it brings together a lot of different perspectives on how different lowercase-a agile companies have handled "SRE" (no matter what it is called locally). (disclosure: I co-contributed a chapter to Seeking SRE and got a handful of copies for free, but I don't get anything from sales)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment