Skip to content

Instantly share code, notes, and snippets.

@cliftonc
Created September 13, 2018 11:29
Show Gist options
  • Save cliftonc/27a24dd8e8ca7f3fbcdbf63c3b43acdc to your computer and use it in GitHub Desktop.
Save cliftonc/27a24dd8e8ca7f3fbcdbf63c3b43acdc to your computer and use it in GitHub Desktop.
war-room.md

Once a live outtage (defined as a serious issue that is impacting a large number of our users) is reported:

  1. A call out should occur in Techy Chat to ensure that the largest number of people possible are aware.

  2. If there are specific services struggling use (insert application / service ownership spreadsheet) to determine ownership and call in the right people.

  3. Always try to involve an operations team member via the Platform Operations room

  4. A First Responder is nominated/steps up in techy chat

  5. The First Responder is responsible for the following tasks:

    • Open a zoom room for the outtage
    • all joiners to the room should turn off video
    • Update the War Room topic
  6. Move conversation into the War Room

  7. Find the 2 - 3 people who could most effectively address the issue and get them into the War Room

  8. Make sure a member of the operations team is in the room.

  9. If the live outage is going to last longer than 30 minutes update the status page

  10. Once the issue is resolved (does not need to be First Responder):

  • create a post-mortem following the directions here
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment