Instantly share code, notes, and snippets.

@jmewes / Secret
Created Jun 12, 2018

What would you like to do?
Google SRE: Chapter 12 - Effective Troubleshooting

Troubleshooting process

Reproducible test cases

  • makes experiments easier

Understand current state

  • Telemetry, monitoring
  • Search for hints and correlations where the problem began
  • analyze data flow and connections between components
  • What? Where? Why?

Keep a log book

  • Experiments and results
  • Helps to tame complexity

Divide and conquer

  • Examine each component individually
  • bisection

Understand the system

  • Deep understand how the system is supposed to work is the foundation for coming up with possible failure causes
  • Failure modes?

Tips and tricks

  • Before troubleshooting, make the system work for the users
  • Prevent recurrence
  • Write postmortems
  • What where recent changes in the system or load?

Design for debuggging

  • Enable to analyze behavior of each component and see whether it is behaving correctly
  • Expose state
  • Make it so easy to understand that you don’t need to refer to architecture diagrams
  • Establish well defined interfaces for transformation from input to output
  • “Building observability—with both white-box metrics and structured logs—into each component from the ground up”


  • Enable to understand exactly what each process is doing
  • Use different log levels
  • Use selection language to filter logs

Build tools

  • help with diagnosing services
  • Google SREs spend much time with this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment