amodm/wdp-20220115.md

## wdp-20220115.md

      
    Raw
  

              wdp-20220115.md
            
          
    How debuggers work

Writing this post in response to the WeekendDevPuzzle of 2022-01-15. This thread has a bunch of submissions, and it's insightful to go over how we often think about debuggers. I'll be linking this post to this twitter thread, which may also contain some follow up twitter follow ups later.
On Linux

Let's start with a simple piece of code. This is the program we'll be debugging. As you can see, it's a fairly simple piece of code. We will be setting a breakpoint on the crashtest function, crashing the debugger and seeing what happens.
Let's start with running this program (we've compiled using gcc -o testdebugcrash testdebugcrash.c). We'll be running this on Linux for now (the importance of this will be understood later).

So, it's asking us to attach the debugger, and is also showing us the 1st 4 bytes of the machine code corresponding to the function crashtest. So let's do that (in another terminal). As you can see I'm doing 3 separate things, once gdb attaches to the testdebugcrash process:

disassemble crashtest (show machine instructions)
break crashtest (set breakpoint at "crashtest")
continue (resume the program)


Now, let's "crash" the debugger. We'll do that by doing a kill -9 <debugger-pid>. Unlike most other signals, -9 won't give the debugger process a chance to clean up. That's as close to simulating a crash as we can get.
Let's see what happens to our program, once the debugger crashes.

As you can see, our program crashed with an error of Trace/breakpoint trap. What we also notice is that the machine code of our function crashtest also changed. The 1st byte of that function is now 0xcc, which for x86 ISA, raises a software interrupt.
We can also verify this change of machine code, if we were to attach a new debugger session immediately after crashing the first one, as you can see below.

As you can see, when we use a debugger to set a breakpoint, it seems to "patch" the machine code of the program being debugged, and replaces the instruction with a 0xcc/int3 (for x86), or BRK in ARM
Explanation

The way a debugger often operates is by first attaching to a process (via ptrace on Linux), which gives it read/write privileges to that process's memory. When a breakpoint is set, the debugger uses this access to replace the instruction at the breakpoint address with an INT3. When the program resumes execution and ultimately reaches the breakpoint address, the int3 instruction is processed by the CPU to raise a software interrupt, which the OS converts into a SIGTRAP signal.
Because the debugger is ptrace-ing the program, the OS first sends the signal to the debugger, while also pausing the execution of the program (removing it from the schedulable tasks). This allows the debugger (or the person behind it) to view any memory/variables or anything else, before resuming the execution of the program. When the debugger wants to "step through" or "resume" the program, the machine instruction is "repatched" to the original machine code. This is possible because the debugger maintains that information when it originally patched.
But if the debugger crashes before the breakpoint was reached, that INT3 just stays in the memory, with nobody to clean it up. The resulting SIGTRAP signal will need to be processed by that program itself. If the program has a signal handler installed for SIGTRAP (as was the case here), that gets invoked, and the program doesn't crash immediately, but even then, the original machine code which was replaced with INT3 is no more there, so highly likely that the program will do something unpredictable. Most often, programs don't have a handler for SIGTRAP, and they just crash.
Other realities

macOS

Try the same thing as above on a macOS (you may need to switch to lldb instead of gdb), and you'd find no such crash. I was surprised by this. Upon some digging, it seems that in macOS, lldb seems to use something called BreakpointSite::eExternal, instead of the usual BreakpointSite::eSoftware, which is very interesting, because it's able to set a breakpoint without modifying the instructions.
I have not dug deeper into the working mechanism of it, but if you know the details, I'd love to hear more of it. You can share your thoughts here, or on this twitter thread
Managed runtimes like JVM etc.

For something like a JVM, which exposes remote debugging options, the JVM process is most likely doing the patching/repatching of the bytecode (as breakpoints are getting set against the java bytecode), so no such crash should happen, unless we've attached the debugger directly to the JVM process, instead of the debugging protocol exposed by it.
Conclusion

So if I were to summarise, the answer to the original puzzle, would be "it depends", on:

If the debugger that crashed, is using a software patch (int3/brk) mechanism, high likely, but no guarantee, that the debugged program will eventually crash (remember that there's always a possibility that the breakpoint is never hit!)
On macOS, at least with lldb, an eExternal mechanism seems to be getting used, which doesn't lead to a patch, and thus a guarantee of no-crash.

In any case, it's always interesting to know how debuggers work, and if you look close enough, it almost feels like a hack, even if a well managed one. If it does, you'd be surprised at the number of places in software engineering, we use very similar dynamic patching techniques.