Skip to content

Instantly share code, notes, and snippets.

@andrewhowdencom
Last active October 19, 2023 19:38
Show Gist options
  • Star 25 You must be signed in to star a gist
  • Fork 12 You must be signed in to fork a gist
  • Save andrewhowdencom/66ad29494db2e5465b1222bd64fccd95 to your computer and use it in GitHub Desktop.
Save andrewhowdencom/66ad29494db2e5465b1222bd64fccd95 to your computer and use it in GitHub Desktop.
Document Templates

Post mortem of the THING: DATE

What is this document?

A post mortem is a write up of a system or process failure such that causes can be found and tasks created to prevent this class of issues from happening in future. It is blameless, and as such does not focus on past events as they pertain to “Could’ve”, “Should’ve”.

Abstract

A_SHORT_SUMMARY

All related work can be tracked in the canonical bug tracker at:

  • BUG_TRACKER_URL_WITH_FILTER_FOR_TICKETS

Contributing Factors

FACTOR_TITLE

FACTOR_DESCRIPTION

Mitigating Factors

FACTOR_TITLE

FACTOR_DESCRIPTION

Ongoing Risks

RISK_TITLE

RISK_DESCRIPTION

Impact

Customer

The impacts to the users of the service include:

IMPACT_TITLE

A DESCRIPTION OF THE IMPACT

Business

The impacts to the business that owns the service includes:

IMPACT_TITLE

A DESCRIPTION OF THE IMPACT

This can be seen in THE_NUMERIC_RESULT_OF_THE_IMPACT

Internal

The impacts to those who build and administer the service include:

IMPACT_TITLE

A DESCRIPTION OF THE IMPACT

This can be seen in THE_NUMERIC_RESULT_OF_THE_IMPACT

Timeline

Time

Event

Authors

  • INCIDENT_COMMAND

Resources

Resource

Location

Slack Channel

THE_SLACK_CHANNEL_USED_TO_DOCUMENT_FINDINGS

Thanks

  • OPERATIONAL_WORK

  • COMMUNICATION

  • PLANNING

  • INCIDENT_COMMAND

Appendices

People

Role

Person

Incident Coordinator

INCIDENT_COMMAND

Operational Work

OPERATIONAL_WORK

Communication

COMMUNICATION

Planning

PLANNING

Terms

Table 1. Terms

Term

Meaning

Customer

The users of the application or service

Project Owner (PO)

The owner of the project

Incident Coordinator (IC)

The person or people responsible for managing and documenting the incident

Operations (OPS)

The person or people responsible for investigating and providing a temporary solution for the issue

Communication (COM)

The person who is designated the point of contact between all team members

Planning (PLN)

The person who provides accountability that all follow up changes need to be made.

Contributing Factor

Something that happened that played a role in causing or prolonging the outage. Factors are broken down into "primary" and "secondary" factors.

Primary Contributing Factor

Something that caused the outage directly

Secondary Contributing Factor

Something that prolonged the outage, though it did not directly cause it

Mitigating Factor

Something that helped reduce the severity or length of the outage, but was not part of structured procedure or normal circumstance.

Ongoing Risk

Something that could have the outage worse or more likely to occur, but simply due to fortune did not.

References

Referenced thing

THE_THING_BEING_REFERENCED

Request for Comments: X

This is a request for comments. It follows a defined formatcite:[rfc2019] and is designed to communicate the justification for a given process or technology change with as much clarity as possible.

This RFC can be accessed at the following URL:

  • URL

Abstract

Stakeholders

Problem

Solution

Verifying Solution

Anticipated Difficulties

Risks

Previous Examples

Expert Opinion

Estimated Costs

Implementation

Proof of Concept

Scale

Enforcement

Completion & Evaluation

Rollback

Terminology

Word Definition

References

bibliography::[]

@andrewhowdencom
Copy link
Author

View raw to see comments describing what this document should have visibly.

@antoniuskoch
Copy link

Will steal this when required

@andrewhowdencom
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment