nathenharvey/sample_post_mortem_template.md

## sample_post_mortem_template.md

      
    Raw
  

              sample_post_mortem_template.md
            
          
    INCIDENT DATE - INCIDENT TYPE

Meeting

Meeting Participants

Incidents should be scheduled within 24 hours or two business days of incident
resolution.  Keep the following in mind when scheduling the meeting:

Who is required to attend the postmortem?


Incident commander and people actively involved in incident response.
A representative from each team that is likely to have action items assigned.
Postmortems should be given schedule priority by everyone required to
participate.  Meetings should be postponed or continued when all required
participants are unable to attend or remain in the meeting.


Who should be invited to attend the postmortem?


At a minimum, the engineering and operations teams.
The entire company is welcome but they won't know about the postmortem if
you don't invite them so consider doing so.
The public - in some instances it is appropriate to open a postmortem to the
public.  Hold the postmortem via a Google Hangout or open Zoom.

Waiving meetings

In some cases the Incident Commander (IC) might determine that a PM meeting for
the incident isn't needed.  If the IC decides to waive the meeting please
replace the Meeting section with a note indicating the meeting has been waived
(example: Meeting waived: Jane Doe)
Record the meeting

In general meetings should be recorded to enable anyone not present to gain the
context of the meeting.  Add a link to the recording at the end of this
document. If a meeting isn't recorded add a statement for why it wasn't recorded.
Start every PM stating the following


This is a blameless Post Mortem.
We will not focus on the past events as they pertain to "could've",
"should've", etc.
All follow up action items will be assigned to a team/individual before the
end of the meeting. If the item is not going to be top priority leaving the
meeting, don't make it a follow up item.

Incident Leader: Someone's name

Description

Short explanation of the issue (1 or 2 sentences)
Timeline

Please note the time to detect and time to resolve and add to the incidents
list
Timeline of events, including exact duration of downtime.  The timeline should
be in chronological order, showing what happened when, but it should also
explain what the team knew at the time.
For example, someone deploys a bad build that triggers an alert, but no one
initially realizes this is what happened. The timeline should list first that
the bad build was deployed, but that the oncall person was not aware of this at
the time it occurred. Later the timeline might list an event where the oncall
person becomes aware this is the case.
To facilitate future discussion, it's helpful to include the person who
performed an action or identified a contributing factor within the timeline.
All timestamps in the timeline should be in a single timezone, and the timezone
should be noted at the beginning of the timeline. 24-hour UTC is preferred
whenever possible.
Example timeline

(all times UTC)

13:35: John Doe (JD) delivers the pending MyFace change.
13:36: Mark Anyperson (MA) receives a PagerDuty alert indicating failure of
the MyFace app.
13:38: MA finds the MyFace change performed by JD and contacts him via Slack.
13:40: Incident declared by MA. Zoom started.

Time to Detect and Resolve

The time to detect the issue and time to resolve should be clearly documented and caputured below.

**Time to detect -
**Time to resolve -

Contributing Factor(s)

Technical explanation of the issue.  Should define the contributing factor(s)
and why it's an issue.
Stabilization Steps

What specific steps and actions were taken to stabilize the issue.  This does
not always entail a "fix" as further actions should be listed under "corrective
actions"
Impact

What was the impact of the incident.  This should include the total duration of
the outage if applicable.
Corrective Actions

Action items going forward to fix the issue and reduce chance of contributing
factors being an issue.
This MUST include owners/teams assigned to these actions to see them
through, and have an issue tracked in this repository (or otherwise linked to
external team kanban/issue tracker).
Link to meeting recording

Place the recording once ready in the Engineering Incidents folder on Drive, and
then link to the recording.
Ensure the incident name and the date the postmortem was held are part of the
filename for the recording so it is easy to find.