Skip to content

Instantly share code, notes, and snippets.

@jmewes jmewes/notes.org Secret
Created Jun 12, 2018

Embed
What would you like to do?
Google SRE: Chapter 12 - Effective Troubleshooting

https://landing.google.com/sre/book/chapters/effective-troubleshooting.html

Troubleshooting process

Reproducible test cases

  • makes experiments easier

Understand current state

  • Telemetry, monitoring
  • Search for hints and correlations where the problem began
  • analyze data flow and connections between components
  • What? Where? Why?

Keep a log book

  • Experiments and results
  • Helps to tame complexity

Divide and conquer

  • Examine each component individually
  • bisection

Understand the system

  • Deep understand how the system is supposed to work is the foundation for coming up with possible failure causes
  • Failure modes?

Tips and tricks

  • Before troubleshooting, make the system work for the users
  • Prevent recurrence
  • Write postmortems
  • What where recent changes in the system or load?

Design for debuggging

  • Enable to analyze behavior of each component and see whether it is behaving correctly
  • Expose state
  • Make it so easy to understand that you don’t need to refer to architecture diagrams
  • Establish well defined interfaces for transformation from input to output
  • “Building observability—with both white-box metrics and structured logs—into each component from the ground up”

Logging

  • Enable to understand exactly what each process is doing
  • Use different log levels
  • Use selection language to filter logs

Build tools

  • help with diagnosing services
  • Google SREs spend much time with this
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.