Art of Debugging
Obligatory disclaimer, this is all opinion and cannot possibly generalize to all problems, workflows, environments, etc. This is meant primarily as a launching point for an open-ended discussion on best practices in debugging code. This is a sketch of what may be a brief book that covers each point below with anecdotes, examples, and techniques.
The ultimate goal of a debugging session is to rule out possibilities until only one remains. All applications of best practices, techniques, and tools should be pointed at this purpose (contextualize all points below against this statement). Debugging is first and foremost a critical thinking problem, and presence of mind is your most critical asset. Establish a mental model but do not be afraid to invalidate it as the investigation unfolds.
Engineers don't use debuggers (system tracers, profilers, packet dumps, heap analyzers, memory sanitizers) when they should
Engineers rely on debuggers (system tracers, profilers, packet dumps, heap analyzers, memory sanitizers) when they shouldn't
Corollary to 1 and 2: Knowledge about the tools, what they are capable of, and how to use them is critical to avoid falling into either trap.
The overwhelming number of bugs can be discovered by inspection and a little bit of thought
Corollary: the top few lines of a backtrace is all you should need to find the bug in the majority of cases
If hypothesis 3 is often violated, architectural rebalancing is likely in order
Despite the debugger not being critical in the majority of use cases, knowing it well is important
When debugging multithreaded applications, reduce complexity first. Reduce the problem to as few threads as possible. Log actions and pay attention to the thread id (the log should be serialized on a single thread).
Hypothesis 6 generalizes to other "hard" problems. Simplify first. Comment out code. Remove things. Test fixes with minimal changes to validate or invalidate a possible avenue of investigation.
It's not always the tricky code that breaks. Only debug a race condition if you first verify that it is, in fact, a race condition (see hypothesis 6).
Things that changed recently are the least likely things to break. Assign probabalistic blame to the following things IN THIS ORDER (descending from most likely source of the problem, to least likely)
- Code that was recently authored (aka your code)
- Code that has broken before
- The way your code interacts with someone else's code
- The way your code interacts with third party code
- Third party code
- Debugging tool/reporting malfunction
- Operating system
There are no accidents. Never discount or throw away information during an investigation. Don't ignore crashes, logs, anomalies, warnings, errors, etc, no matter how infrequent or rare they may seem. If it's happened once, it will happen again, and each incident may contain valuable clues about the system's health as a whole.
Programmers spend time retreading ground too much. Setting up a repeatable breakage is often the first step in the debugging process. However, don't waste cycles retrying too many times if you don't have a new idea, or it won't yield new information.
In well designed systems, the possible causes of a problem that need to be ruled out per hypothesis 0 is low, regardless of the complexity of the problem (!).