Skip to content

Instantly share code, notes, and snippets.

@joelagnel
Last active March 12, 2023 15:42
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save joelagnel/ae15c404facee0eb3ebb8aff0e996a68 to your computer and use it in GitHub Desktop.
Save joelagnel/ae15c404facee0eb3ebb8aff0e996a68 to your computer and use it in GitHub Desktop.

Kernel Debug Tricks of the trade

NOTE: Updated on 3/11/2023.

Introduction

Debugging the kernel is a difficult task. Thankfully there are quite a few built-in kernel debugging tools and techniques that help one quickly identify the source and/or the cause of an issue.

Enable debug information

The CONFIG_DEBUG_INFO option is often default-disabled in the kernel. Enabling it is required if one has to translate kernel code section addresses to line numbers, among other things. It is best to enable it if you are actively looking at stack traces or other debug information related to an issue.

This option can only be indirectly enabled by enabling CONFIG_DEBUG_INFO_DWARF5

Once this is enabled, tools like gdb and addr2line can translate addresses into line numbers in the code, as well decipher data structures in kernel memory.

Example:

addr2line -e vmlinux <Kernel Text Address>

Disable ASLR

A feature KASLR randomizes where the kernel is actually loaded into memory. This presents a major challenge for debugging tools which require the offset that the kernel was loaded at, to make sense of the debug information.

This can be disable by setting CONFIG_RANDOMIZE_BASE to n.

This may only be useful during live gdb debugging. However, tools like crash can figure the base address, out from the core dump itself. Still probably just disable it anyway to be safe.

Dump traces to console on panic

Boot options trace_event= and ftrace_dump_on_oops can be used to enable tracing and dump the traces to the console.log whenever the kernel panics or crashes.

Example:

trace_event=sched:sched_kthread_stop,sched:sched_switch ftrace_dump_on_oops trace_buf_size=300K

Note that quotes are not required around the trace events.

Panic the kernel when an issue happens

Dumping of traces to the console can be triggered when the kernel panics. So it follows that, so it follows that one way to trigger a console trace dump is to panic the kernel when an issue is detected, and look at the traces just before the panic happened.

This is typically triggered by adding panic() to the troublesome spot. However, if there is already a warning of RCU stall during the issue, then the following boot options can be passed to trigger a console trace dump:

Example:

panic_on_warn=1
sysctl.kernel.panic_on_rcu_stall=1
sysctl.kernel.max_rcu_stall_to_panic=1

Also to avoid tracing the system during panic, you can add the following to that, to stop tracing when panic happens:

sysctl kernel.traceoff_on_warning=1

RCU stall timeouts

There is already in-depth documentation on how to detect RCU stalls, which will avoid duplicating here.

However, it is worthwhile to note that the CONFIG_RCU_CPU_STALL_TIMEOUT config option can be set to a number smaller than the default timeout of 20 seconds, to trigger RCU stalls faster. On top of that, the rcutorture kernel module can itself be used to trigger synthetic RCU stalls, which are useful to study RCU stall behavior and logs.

A full example

Since the author is also an RCU reviewer, he often runs rcutorture tests in Qemu.

The following is a test command with debugging-related kernel command line and config options to debug a TREE07 RCU stall issue using the rcutorture scripts:

tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 5 --configs "100*TREE07" 
  --bootargs "trace_event=sched:sched_kthread_stop,sched:sched_kthread_stop_ret,sched:sched_switch,
              sched:sched_waking,sched:sched_wakeup,rcu:rcu_callback,rcu:rcu_fqs,rcu:rcu_grace_period
              ftrace_dump_on_oops panic_on_warn=1 sysctl.kernel.panic_on_rcu_stall=1
              sysctl.kernel.max_rcu_stall_to_panic=1 traceoff_on_warning=1 trace_buf_size=300K"
  --kconfig "CONFIG_RCU_TRACE=y CONFIG_DEBUG_INFO_DWARF5=y CONFIG_RANDOMIZE_BASE=n
             CONFIG_RCU_CPU_STALL_TIMEOUT=5"

Note: Ensure the boot and kconfig arguments are passed in a single line. In the above, newlines have been added to keep it readable.

Shot-gun debugging

This kind of debugging was inspired by Steven Rostedt. The idea is sprinkle printk() or trace_printk() all over the code being debugged, by copying and pasting.

Example:

printk("%s: (%s) (%d)", __func__, __FILE__, __LINE__);
// or
trace_printk("We are here: (%s) (%d)", __FILE__, __LINE__);

Note that trace_printk already emits the function name in the beginning of the line, so there is no need to pass __func__ to it, like we have to for printk().

Dumping the stack to understand the call path better

The function trace_dump_stack(0) may be called any where in the kernel, to dump the call stack to the trace buffer.

For example, the following diff was useful when I had to figure out where uninterruptible sleeps (D-states) were coming in the kernel:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c2bb0cd4225d5..d79bc3b91818a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6390,6 +6390,8 @@ pick_next_task(struct rq *rq, struct task_struct* prev, struct rq_flags *rf)
*
* WARNING: must be called with preemption disabled!
*/
+static int rcu_tort_ds;
+
static void __sched notrace __schedule(unsigned int sched_mode)
{
struct task_struct *prev, *next;
@@ -6452,6 +6454,11 @@ static void __sched notrace __schedule(unsigned int sched_mode)
  if (prev->sched_contributes_to_load)
       rq->nr_uninterruptible++;
+ // Dump stack for uninterruptible every 10 stacks.
+   if (prev_state & TASK_UNINTERRUPTIBLE && ((rcu_tort_ds++ % 10) == 0)) {
+       trace_dump_stack(0);
+ }

Figuring out where kernel crashes happen in C code

Often a stack presented requires some reverse engineering to figure where in C code a crash happens.

Example, while trying to debug a crash in hrtimer where seemingly, the CPU tripped on an instruction that is a null pointer. Huh, did I correct the text section? RIP format is <CS>:<Address>. And the address clearly showing 0:

[   15.450607] BUG: kernel NULL pointer dereference, address: 0000000000000000            
[   15.454936] #PF: supervisor instruction fetch in kernel mode
[   15.458465] #PF: error_code(0x0010) - not-present page                                                                 [   15.461655] PGD 0 P4D 0                                                                                               [   15.463319] Oops: 0010 [#1] PREEMPT SMP PTI                                                                           [   15.465949] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 6.1.0-00008-g621948fff66f #54
[   15.476520] RIP: 0010:0x0
[   15.478220] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
[   15.482218] RSP: 0018:ffffc900000f8f00 EFLAGS: 00010082
[   15.485445] RAX: 0000000080010001 RBX: 0000000000000000 RCX: 0000000000000000
[   15.489835] RDX: 0000000000010002 RSI: 0000000000000046 RDI: ffff88807d71e440
[   15.494210] RBP: ffff88807d71df00 R08: ffff88807d71df40 R09: 0000000000000000
[   15.498594] R10: 0000000000000000 R11: ffffc900000f8ff8 R12: 0000000000000046
[   15.502979] R13: ffff88807d71df60 R14: ffff88807d71e440 R15: ffff88807d71df40
[   15.507416] FS:  0000000000000000(0000) GS:ffff88807d700000(0000)
[   15.512365] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   15.515945] CR2: ffffffffffffffd6 CR3: 0000000006928000 CR4: 00000000000006e0
[   15.520346] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   15.524729] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   15.529117] Call Trace:
[   15.530725]  <IRQ>
[   15.532094]  __hrtimer_run_queues+0x10d/0x2a0
[   15.534900]  hrtimer_run_queues+0xcf/0xf0   
[   15.537423]  update_process_times+0x33/0xb0 
[   15.540036]  tick_sched_handle+0x2f/0x40    
[   15.542482]  tick_nohz_handler+0x62/0xa0    
[   15.544942]  __sysvec_apic_timer_interrupt+0x7a/0x160
[   15.548141]  sysvec_apic_timer_interrupt+0x85/0xb0   
[   15.551131]  </IRQ>                                  
[   15.552531]  <TASK>                                  

Ok, so first enable DWARF debug info in kernel binary which maps C code's line numbers to addresses:

CONFIG_RANDOMIZE_BASE=y  # not strictly needed as we're working with offsets.
CONFIG_DEBUG_INFO_DWARF5=y

And reproduce the crash. As we see above, the crash happens somewhere in __hrtimer_run_queue function, at offset 0x10d from its beginning. Dump objdump -S hrtimer.o:

Now, add 0x10d to the first address of __hrtimer_run_queues, this amounts to (in my case) 0x5ad, that takes us to some instructions interleaved with C code as follows:

...

     583:       41 8b 4f 08             mov    0x8(%r15),%ecx
     587:       b8 fe ff ff ff          mov    $0xfffffffe,%eax
     58c:       d3 c0                   rol    %cl,%eax
     58e:       21 43 08                and    %eax,0x8(%rbx)
        raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
     591:       4c 89 e6                mov    %r12,%rsi
     594:       48 89 ef                mov    %rbp,%rdi
        fn = timer->function;
     597:       49 8b 5e 28             mov    0x28(%r14),%rbx
        raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
     59b:       e8 00 00 00 00          call   5a0 <__hrtimer_run_queues+0x100>
     5a0:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
        restart = fn(timer);
     5a5:       4c 89 f7                mov    %r14,%rdi
     5a8:       e8 00 00 00 00          call   5ad <__hrtimer_run_queues+0x10d>
     5ad:       89 c3                   mov    %eax,%ebx

This tells me it must have happened due to fn(timer). Somehow I corrupted the function pointer corresponding to the timer callback. Bingo!

I then re-run the code with an assert BUG_ON(!fn) set, which confirms that that is indeed the case.

Figuring out where kernel crashes happen in C code using offline GDB

It is possible to use offline GDB to figure out which line of C code crashed, or which line of C code does a particular symbol+offset/size entry in a call stack corresponds to. The address of functions in a vmlinux binary are different from the addresses of where the kernel might be loaded in physical memory. This is because of KASLR which can be turned off using CONFIG_RANDOMIZE_BASE, however it is better to not have to depend on turning off (which we don't need to depend on luckily, for offline debugging). Keeping the debug methodology independent of CONFIG_RANDOMIZE_BASE also means that random kernel crashes in the field on kernels with this config option enabled can be analyzed.

For a demo of the methodology, lets crash the kernel via sysrq-trigger:

root@(none):/# echo c > /proc/sysrq-trigger 
[  123.209323] sysrq: Trigger a crash                                               
[  123.211665] Kernel panic - not syncing: sysrq triggered crash                    
[  123.215186] CPU: 2 PID: 94 Comm: bash Not tainted 6.3.0-rc1-00006-g3b8ea6d0481b #106
[  123.219828] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[  123.225578] Call Trace:
[  123.227367]  <TASK>
[  123.228762]  dump_stack_lvl+0x36/0x50                                            
[  123.231214]  panic+0x33a/0x340                                                   
[  123.233203]  ? _printk+0x5f/0x80  
[  123.235265]  sysrq_handle_crash+0x15/0x20                    
[  123.237792]  __handle_sysrq+0xa5/0x180                                                                                                           
[  123.240119]  write_sysrq_trigger+0x23/0x40                                                                                                     
[  123.242625]  proc_reg_write+0x54/0xa0
[  123.244940]  vfs_write+0xc6/0x3e0
[  123.247046]  ksys_write+0x6a/0xf0    
[  123.249113]  do_syscall_64+0x3e/0x90
[  123.251428]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[  123.254541] RIP: 0033:0x7f92273a5274                                             
[  123.256757] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 [..]
[  123.267749] RSP: 002b:00007ffff3889ef8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  123.272318] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f92273a5274
[  123.276621] RDX: 0000000000000002 RSI: 000055ca91d7c7b0 RDI: 0000000000000001
[  123.280876] RBP: 000055ca91d7c7b0 R08: 000000000000000a R09: 00007f9227434e80
[  123.285154] R10: 000000000000000a R11: 0000000000000246 R12: 00007f9227476760
[  123.289429] R13: 0000000000000002 R14: 00007f9227471760 R15: 0000000000000002
[  123.293692]  </TASK>                
[  123.296646] Kernel Offset: 0xaa00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)                                               
[  123.303003] Rebooting in 1 seconds..                                                                                                                                 
SeaBIOS (version 1.16.0-debian-1.16.0-5)                                     

Run GDB offline in the kernel source root on vmlinux:

$ gdb ./vmlinux 
GNU gdb (Debian 12.1-4) 12.1
[...]
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./vmlinux...
(gdb) 

Say we want to find out which line does the panic+0x33a/0x340 in the stack corresponds to.

First let us get the absolute address of panic+0x33a. 0x33a is 826 in decimal.

First run disass panic, and keep hitting enter till you see +826:

(gdb) disass panic
[...]
   0xffffffff810849b4 <+804>:   call   0xffffffff81060b00 <arch_trigger_cpumask_backtrace>
   0xffffffff810849b9 <+809>:   jmp    0xffffffff81084974 <panic+740>
   0xffffffff810849bb <+811>:   movb   $0x0,-0x1(%rdx,%rax,1)
   0xffffffff810849c0 <+816>:   jmp    0xffffffff81084785 <panic+245>
   0xffffffff810849c5 <+821>:   call   0xffffffff81e258f0 <dump_stack>
   0xffffffff810849ca <+826>:   jmp    0xffffffff810847b0 <panic+288>

The address starting with 0x beside the +826 is 0xffffffff810849ca which is the adadress of panic+0x33a.

Now let us see what line in C code does that panic+033a correspond to. For that, run the li command passing it *ADDRESS. The star before the address is important.

(gdb) li *0xffffffff810849ca
0xffffffff810849ca is in panic (kernel/panic.c:340).
335     #ifdef CONFIG_DEBUG_BUGVERBOSE
336             /*
337              * Avoid nested stack-dumping if a panic occurs during oops processing
338              */
339             if (!test_taint(TAINT_DIE) && oops_in_progress <= 1)
340                     dump_stack();
341     #endif

gdb tells you on the first line that 0xffffffff810849ca is in panic (kernel/panic.c:340). Voila!

NOTE: For seeing the opcodes corresponding to the disassembly, you can pass /r switch to disass.

NOTE: The /m switch to disass will show you lines of C code interspersed with the assembly, however I found that with optimized code, this interspering can be messy and hard to read. objdump -S is another option to do this, however it may be equally messy.

Create core-dumps from Qemu and analyzing them with crash

WIP

virsh could create coredumps similar to what kdump/kexec does https://wiki.ubuntu.com/DebuggingKernelWithQEMU

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment