https://landing.google.com/sre/book/chapters/effective-troubleshooting.html
- makes experiments easier
- Telemetry, monitoring
- Search for hints and correlations where the problem began
- analyze data flow and connections between components
- What? Where? Why?
- Experiments and results
- Helps to tame complexity
- Examine each component individually
- bisection
- Deep understand how the system is supposed to work is the foundation for coming up with possible failure causes
- Failure modes?
- Before troubleshooting, make the system work for the users
- Prevent recurrence
- Write postmortems
- What where recent changes in the system or load?
- Enable to analyze behavior of each component and see whether it is behaving correctly
- Expose state
- Make it so easy to understand that you don’t need to refer to architecture diagrams
- Establish well defined interfaces for transformation from input to output
- “Building observability—with both white-box metrics and structured logs—into each component from the ground up”
- Enable to understand exactly what each process is doing
- Use different log levels
- Use selection language to filter logs
- help with diagnosing services
- Google SREs spend much time with this