jmewes/notes.org Secret

## notes.org

      
    Raw
  

              notes.org
            
          
    https://landing.google.com/sre/book/chapters/effective-troubleshooting.html
Troubleshooting process

Reproducible test cases


  makes experiments easier

Understand current state


  Telemetry, monitoring
  Search for hints and correlations where the problem began
  analyze data flow and connections between components
  What? Where? Why?

Keep a log book


  Experiments and results
  Helps to tame complexity

Divide and conquer


  Examine each component individually
  bisection

Understand the system


  Deep understand how the system is supposed to work is the foundation for coming up with possible failure causes
  Failure modes?

Tips and tricks


  Before troubleshooting, make the system work for the users
  Prevent recurrence
  Write postmortems
  What where recent changes in the system or load?

Design for debuggging


  Enable to analyze behavior of each component and see whether it is behaving correctly
  Expose state
  Make it so easy to understand that you don’t need to refer to architecture diagrams
  Establish well defined interfaces for transformation from input to output
  “Building observability—with both white-box metrics and structured logs—into each component from the ground up”

Logging


  Enable to understand exactly what each process is doing
  Use different log levels
  Use selection language to filter logs

Build tools


  help with diagnosing services
  Google SREs spend much time with this