Skip to content

Instantly share code, notes, and snippets.

@marcoslhc
Last active April 27, 2022 15:11
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save marcoslhc/08c4123960be6b0218923bcb4a0e0682 to your computer and use it in GitHub Desktop.
Save marcoslhc/08c4123960be6b0218923bcb4a0e0682 to your computer and use it in GitHub Desktop.

Root Cause Analysis

This template is thought for backwards analyisis of an incident,
trying to move recursively in causes asking: "what caused this
effect?" and substituting effect for cause until you reach a
satisfactory point.

Sequence of Events

A brief, time based sequential account of the events leading to
the discovery of the failure, call of the emergency, investiga-
tion, and then the resolution; if any.

Systems Affected

List of all the systems that were impacted. Work "radially" from
the point of failure (or primary down system) and outwards.

Origin of the problem (Root Cause)

A more detailed technical explanation of the investigation results;
identifying, if possible, the mistake or point of inflection where
the path of decisions turned into the cause of the problem.

This can include:

  • Forensics
  • Benchmarks
  • Business Decisions

The point of failures after the mistake were:

Usually is not only one mistake made, but is also followed by other
mistakes, failures of recognition, compounded events, etc. make mention
of them here

Resolution:

Write in short sentences how the problem was "hot" solved and, if necessary,
add the additional steps to finally solve the issue permanently.

Learnings:

A more abstract set of conclussions and decisions to prevent this kind of issues
again. This should be a second order analisis focusing less on the details of
the incident and more about the processes, rules, behaviors or culture that lead
to the incident

Fallout - Clean up

It there were some side effects from the issue or incident, mention them here. If the
root cause analysis is backwards, this part should be forward: "what will happen after
this incident? in what state is the system now? what events were triggered by this?"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment