Skip to content

Instantly share code, notes, and snippets.

@islomar
Last active March 19, 2020 12:13
Show Gist options
  • Save islomar/81a3be61fe4375c92aa6b92c88b91e40 to your computer and use it in GitHub Desktop.
Save islomar/81a3be61fe4375c92aa6b92c88b91e40 to your computer and use it in GitHub Desktop.
Troubleshooting : incident resolution cheat sheet

Troubleshooting: incident resolution

Table of Contents

Checklist

  • First of all: do "pair or mob problem solving" :-)
  • How did you get notify about the problem? (monitoring system, email, customer call, etc.)
  • Does it happen to every customer? How does it affect them?
    • If it impacts your customers, follow company protocols to notify your customers.
  • Dependineg on the impact on the customers, have a clear deadline for solving it: it might be 0 seconds, 5 minutos, 1 hour, etc. After that time, do a quick fix if possible (e.g. kill the container and get up a new one?).
  • Try to verify the problem yourself.
  • Does it happen always? Only "sometimes"?
  • When does it happen? Accessing where and doing exactly what?
  • Look at the logs
  • Look at the monitoring systems (Grafana, Kibana, etc.)
  • Knowing the network topology: pings with different sizes (I've seen MTU problems).
  • Have the architecture and technologies involved very clear.
  • External dependencies: database, web services... Is everything up and running? Access it "by hand".
  • Are there several instances? What is it shared among the instances? (e.g. same DB).
  • Does it happen only in one instance or all of them?
  • It would be nice to have some "system functional tests" prepared, in order to check each element in an independent way: web server, app server, database, message brokers, etc.
  • Go for the tools :-)
  • After everything is finished, write a public postmortem!!

Tools

JVM troubleshooting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment