NOTE: Updated on 3/11/2023.
Debugging the kernel is a difficult task. Thankfully there are quite a few built-in kernel debugging tools and techniques that help one quickly identify the source and/or the cause of an issue.
The CONFIG_DEBUG_INFO
option is often default-disabled in the kernel.
Enabling it is required if one has to translate kernel code section addresses
to line numbers, among other things. It is best to enable it if you are
actively looking at stack traces or other debug information related to an
issue.
This option can only be indirectly enabled by enabling
CONFIG_DEBUG_INFO_DWARF5
Once this is enabled, tools like gdb
and addr2line
can translate addresses
into line numbers in the code, as well decipher data structures in kernel
memory.
Example:
addr2line -e vmlinux <Kernel Text Address>
A feature KASLR randomizes where the kernel is actually loaded into memory. This presents a major challenge for debugging tools which require the offset that the kernel was loaded at, to make sense of the debug information.
This can be disable by setting CONFIG_RANDOMIZE_BASE
to n
.
This may only be useful during live gdb debugging. However, tools like crash can figure the base address, out from the core dump itself. Still probably just disable it anyway to be safe.
Boot options trace_event=
and ftrace_dump_on_oops
can be used to enable
tracing and dump the traces to the console.log whenever the kernel panics or
crashes.
Example:
trace_event=sched:sched_kthread_stop,sched:sched_switch ftrace_dump_on_oops trace_buf_size=300K
Note that quotes are not required around the trace events.
Dumping of traces to the console can be triggered when the kernel panics. So it follows that, so it follows that one way to trigger a console trace dump is to panic the kernel when an issue is detected, and look at the traces just before the panic happened.
This is typically triggered by adding panic()
to the troublesome spot.
However, if there is already a warning of RCU stall during the issue, then the
following boot options can be passed to trigger a console trace dump:
Example:
panic_on_warn=1
sysctl.kernel.panic_on_rcu_stall=1
sysctl.kernel.max_rcu_stall_to_panic=1
Also to avoid tracing the system during panic, you can add the following to that, to stop tracing when panic happens:
sysctl kernel.traceoff_on_warning=1
There is already in-depth documentation on how to detect RCU stalls, which will avoid duplicating here.
However, it is worthwhile to note that the CONFIG_RCU_CPU_STALL_TIMEOUT
config option can be set to a number smaller than the default timeout of 20
seconds, to trigger RCU stalls faster. On top of that, the rcutorture
kernel
module can itself be used to trigger synthetic RCU stalls, which are useful to
study RCU stall behavior and logs.
Since the author is also an RCU reviewer, he often runs rcutorture
tests in Qemu.
The following is a test command with debugging-related kernel command line and
config options to debug a TREE07
RCU stall issue using the rcutorture
scripts:
tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 5 --configs "100*TREE07"
--bootargs "trace_event=sched:sched_kthread_stop,sched:sched_kthread_stop_ret,sched:sched_switch,
sched:sched_waking,sched:sched_wakeup,rcu:rcu_callback,rcu:rcu_fqs,rcu:rcu_grace_period
ftrace_dump_on_oops panic_on_warn=1 sysctl.kernel.panic_on_rcu_stall=1
sysctl.kernel.max_rcu_stall_to_panic=1 traceoff_on_warning=1 trace_buf_size=300K"
--kconfig "CONFIG_RCU_TRACE=y CONFIG_DEBUG_INFO_DWARF5=y CONFIG_RANDOMIZE_BASE=n
CONFIG_RCU_CPU_STALL_TIMEOUT=5"
Note: Ensure the boot and kconfig arguments are passed in a single line. In the above, newlines have been added to keep it readable.
This kind of debugging was inspired by Steven Rostedt. The idea is sprinkle
printk()
or trace_printk()
all over the code being debugged, by copying and
pasting.
Example:
printk("%s: (%s) (%d)", __func__, __FILE__, __LINE__);
// or
trace_printk("We are here: (%s) (%d)", __FILE__, __LINE__);
Note that
trace_printk
already emits the function name in the beginning of the line, so there is no need to pass__func__
to it, like we have to forprintk()
.
The function trace_dump_stack(0)
may be called any where in the kernel, to
dump the call stack to the trace buffer.
For example, the following diff was useful when I had to figure out where uninterruptible sleeps (D-states) were coming in the kernel:
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c2bb0cd4225d5..d79bc3b91818a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6390,6 +6390,8 @@ pick_next_task(struct rq *rq, struct task_struct* prev, struct rq_flags *rf)
*
* WARNING: must be called with preemption disabled!
*/
+static int rcu_tort_ds;
+
static void __sched notrace __schedule(unsigned int sched_mode)
{
struct task_struct *prev, *next;
@@ -6452,6 +6454,11 @@ static void __sched notrace __schedule(unsigned int sched_mode)
if (prev->sched_contributes_to_load)
rq->nr_uninterruptible++;
+ // Dump stack for uninterruptible every 10 stacks.
+ if (prev_state & TASK_UNINTERRUPTIBLE && ((rcu_tort_ds++ % 10) == 0)) {
+ trace_dump_stack(0);
+ }
Often a stack presented requires some reverse engineering to figure where in C code a crash happens.
Example, while trying to debug a crash in hrtimer where seemingly, the CPU tripped on an instruction that is a null pointer. Huh, did I correct the text section? RIP
format is <CS>:<Address>
. And the address clearly showing 0:
[ 15.450607] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 15.454936] #PF: supervisor instruction fetch in kernel mode
[ 15.458465] #PF: error_code(0x0010) - not-present page [ 15.461655] PGD 0 P4D 0 [ 15.463319] Oops: 0010 [#1] PREEMPT SMP PTI [ 15.465949] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 6.1.0-00008-g621948fff66f #54
[ 15.476520] RIP: 0010:0x0
[ 15.478220] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
[ 15.482218] RSP: 0018:ffffc900000f8f00 EFLAGS: 00010082
[ 15.485445] RAX: 0000000080010001 RBX: 0000000000000000 RCX: 0000000000000000
[ 15.489835] RDX: 0000000000010002 RSI: 0000000000000046 RDI: ffff88807d71e440
[ 15.494210] RBP: ffff88807d71df00 R08: ffff88807d71df40 R09: 0000000000000000
[ 15.498594] R10: 0000000000000000 R11: ffffc900000f8ff8 R12: 0000000000000046
[ 15.502979] R13: ffff88807d71df60 R14: ffff88807d71e440 R15: ffff88807d71df40
[ 15.507416] FS: 0000000000000000(0000) GS:ffff88807d700000(0000)
[ 15.512365] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 15.515945] CR2: ffffffffffffffd6 CR3: 0000000006928000 CR4: 00000000000006e0
[ 15.520346] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 15.524729] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 15.529117] Call Trace:
[ 15.530725] <IRQ>
[ 15.532094] __hrtimer_run_queues+0x10d/0x2a0
[ 15.534900] hrtimer_run_queues+0xcf/0xf0
[ 15.537423] update_process_times+0x33/0xb0
[ 15.540036] tick_sched_handle+0x2f/0x40
[ 15.542482] tick_nohz_handler+0x62/0xa0
[ 15.544942] __sysvec_apic_timer_interrupt+0x7a/0x160
[ 15.548141] sysvec_apic_timer_interrupt+0x85/0xb0
[ 15.551131] </IRQ>
[ 15.552531] <TASK>
Ok, so first enable DWARF debug info in kernel binary which maps C code's line numbers to addresses:
CONFIG_RANDOMIZE_BASE=y # not strictly needed as we're working with offsets.
CONFIG_DEBUG_INFO_DWARF5=y
And reproduce the crash. As we see above, the crash happens somewhere in __hrtimer_run_queue
function, at offset 0x10d
from its beginning.
Dump objdump -S hrtimer.o
:
Now, add 0x10d
to the first address of __hrtimer_run_queues
, this amounts to (in my case) 0x5ad
, that takes us to some instructions interleaved with C code as follows:
...
583: 41 8b 4f 08 mov 0x8(%r15),%ecx
587: b8 fe ff ff ff mov $0xfffffffe,%eax
58c: d3 c0 rol %cl,%eax
58e: 21 43 08 and %eax,0x8(%rbx)
raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
591: 4c 89 e6 mov %r12,%rsi
594: 48 89 ef mov %rbp,%rdi
fn = timer->function;
597: 49 8b 5e 28 mov 0x28(%r14),%rbx
raw_spin_unlock_irqrestore(&cpu_base->lock, flags);
59b: e8 00 00 00 00 call 5a0 <__hrtimer_run_queues+0x100>
5a0: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
restart = fn(timer);
5a5: 4c 89 f7 mov %r14,%rdi
5a8: e8 00 00 00 00 call 5ad <__hrtimer_run_queues+0x10d>
5ad: 89 c3 mov %eax,%ebx
This tells me it must have happened due to fn(timer)
. Somehow I corrupted the function pointer corresponding to the timer callback. Bingo!
I then re-run the code with an assert BUG_ON(!fn)
set, which confirms that that is indeed the case.
It is possible to use offline GDB to figure out which line of C code crashed, or which line of C code does a particular symbol+offset/size
entry in a call stack corresponds to. The address of functions in a vmlinux
binary are different from the addresses of where the kernel might be loaded in physical memory. This is because of KASLR
which can be turned off using CONFIG_RANDOMIZE_BASE
, however it is better to not have to depend on turning off (which we don't need to depend on luckily, for offline debugging). Keeping the debug methodology independent of CONFIG_RANDOMIZE_BASE
also means that random kernel crashes in the field on kernels with this config option enabled can be analyzed.
For a demo of the methodology, lets crash the kernel via sysrq-trigger:
root@(none):/# echo c > /proc/sysrq-trigger
[ 123.209323] sysrq: Trigger a crash
[ 123.211665] Kernel panic - not syncing: sysrq triggered crash
[ 123.215186] CPU: 2 PID: 94 Comm: bash Not tainted 6.3.0-rc1-00006-g3b8ea6d0481b #106
[ 123.219828] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-5 04/01/2014
[ 123.225578] Call Trace:
[ 123.227367] <TASK>
[ 123.228762] dump_stack_lvl+0x36/0x50
[ 123.231214] panic+0x33a/0x340
[ 123.233203] ? _printk+0x5f/0x80
[ 123.235265] sysrq_handle_crash+0x15/0x20
[ 123.237792] __handle_sysrq+0xa5/0x180
[ 123.240119] write_sysrq_trigger+0x23/0x40
[ 123.242625] proc_reg_write+0x54/0xa0
[ 123.244940] vfs_write+0xc6/0x3e0
[ 123.247046] ksys_write+0x6a/0xf0
[ 123.249113] do_syscall_64+0x3e/0x90
[ 123.251428] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[ 123.254541] RIP: 0033:0x7f92273a5274
[ 123.256757] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 [..]
[ 123.267749] RSP: 002b:00007ffff3889ef8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 123.272318] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f92273a5274
[ 123.276621] RDX: 0000000000000002 RSI: 000055ca91d7c7b0 RDI: 0000000000000001
[ 123.280876] RBP: 000055ca91d7c7b0 R08: 000000000000000a R09: 00007f9227434e80
[ 123.285154] R10: 000000000000000a R11: 0000000000000246 R12: 00007f9227476760
[ 123.289429] R13: 0000000000000002 R14: 00007f9227471760 R15: 0000000000000002
[ 123.293692] </TASK>
[ 123.296646] Kernel Offset: 0xaa00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 123.303003] Rebooting in 1 seconds..
SeaBIOS (version 1.16.0-debian-1.16.0-5)
Run GDB offline in the kernel source root on vmlinux:
$ gdb ./vmlinux
GNU gdb (Debian 12.1-4) 12.1
[...]
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./vmlinux...
(gdb)
Say we want to find out which line does the panic+0x33a/0x340
in the stack corresponds to.
First let us get the absolute address of panic+0x33a
. 0x33a
is 826
in decimal.
First run disass panic
, and keep hitting enter till you see +826
:
(gdb) disass panic
[...]
0xffffffff810849b4 <+804>: call 0xffffffff81060b00 <arch_trigger_cpumask_backtrace>
0xffffffff810849b9 <+809>: jmp 0xffffffff81084974 <panic+740>
0xffffffff810849bb <+811>: movb $0x0,-0x1(%rdx,%rax,1)
0xffffffff810849c0 <+816>: jmp 0xffffffff81084785 <panic+245>
0xffffffff810849c5 <+821>: call 0xffffffff81e258f0 <dump_stack>
0xffffffff810849ca <+826>: jmp 0xffffffff810847b0 <panic+288>
The address starting with 0x
beside the +826
is 0xffffffff810849ca
which is the adadress of panic+0x33a
.
Now let us see what line in C code does that panic+033a
correspond to. For that, run the li
command passing it *ADDRESS
. The star before the address is important.
(gdb) li *0xffffffff810849ca
0xffffffff810849ca is in panic (kernel/panic.c:340).
335 #ifdef CONFIG_DEBUG_BUGVERBOSE
336 /*
337 * Avoid nested stack-dumping if a panic occurs during oops processing
338 */
339 if (!test_taint(TAINT_DIE) && oops_in_progress <= 1)
340 dump_stack();
341 #endif
gdb tells you on the first line that 0xffffffff810849ca is in panic (kernel/panic.c:340)
. Voila!
NOTE: For seeing the opcodes corresponding to the disassembly, you can pass
/r
switch todisass
.
NOTE: The
/m
switch to disass will show you lines of C code interspersed with the assembly, however I found that with optimized code, this interspering can be messy and hard to read.objdump -S
is another option to do this, however it may be equally messy.
WIP
virsh could create coredumps similar to what kdump/kexec does https://wiki.ubuntu.com/DebuggingKernelWithQEMU