This post shows an example of a BEAM process' core dump analysis to backtrack what was the root cause of its crash.
Start gdb by pointing out the BEAM executable that generated the core dump and the core dump file. Optionally, you can provide a directory with the VM's source code:
gdb erts-6.3/bin/beam.smp -core core.14747 -d /home/azureuser/otp_src_17.4/erts/emulator
gdb should tell you what caused the crash and where in code this happened:
Program terminated with signal 11, Segmentation fault.
#0 do_minor (nobj=2, objv=0x7f209849e5c0, new_sz=233, p=0x7f20a36010d0) at beam/erl_gc.c:1095
1095 val = *ptr;
In this case it looks like an attempt to dereference a pointer in the garbage collection algorithm was a problem, so probably some corrupted term was present on the process' heap.
You can use backtrace
command to see the full call stack:
(gdb) backtrace
#0 do_minor (nobj=2, objv=0x7f209849e5c0, new_sz=233, p=0x7f20a36010d0) at beam/erl_gc.c:1095
#1 minor_collection (recl=<synthetic pointer>, nobj=2, objv=0x7f209849e5c0, need=7, p=0x7f20a36010d0) at beam/erl_gc.c:876
#2 erts_garbage_collect (p=0x7f20a36010d0, need=7, objv=0x7f209849e5c0, nobj=2) at beam/erl_gc.c:450
#3 0x00000000005669bb in process_main () at x86_64-unknown-linux-gnu/opt/smp/beam_hot.h:47
#4 0x000000000049c4b2 in sched_thread_func (vesdp=0x7f209b560a80) at beam/erl_process.c:7743
#5 0x00000000005d5fc5 in thr_wrapper (vtwd=0x7fffa342fff0) at pthread/ethread.c:106
#6 0x00007f20a6ee1df3 in start_thread (arg=0x7f20863b0700) at pthread_create.c:308
#7 0x00007f20a6a071ad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
This is not very helpful, though. We can use the gdb scripts bundled into OTP sources to learn much more from the core dump from the Erlang VM's point of view.
Let's load the macros:
(gdb) source /home/azureuser/otp_src_17.4/erts/etc/unix/etp-commands.in
gdb should then print some basic information about the VM and architecture.
And from now on you can use all the macros from the above file to analyze the core dump.
Let's first try to find out what address was dereferenced:
(gdb) print ptr
$3 = <optimized out>
After looking into erl_gc.c
lines 1093-1095 link:
case TAG_PRIMARY_BOXED: {
ptr = boxed_val(gval);
val = *ptr;
How about trying this:
(gdb) print gval
$4 = 139760485281282
So the actual term on the heap was 0x7F1C86142E02
which is a 0x7F1C86142E00
pointer tagged as a "boxed" term and might point to a tuple, bigint etc.
Now, let's fetch some basic information about the Erlang process (p
is an argument to do_minor
):
(gdb) etp-process-info p
Pid: <0.2845.0>
State: trapping-exit | garbage-collecting | running | active | prq-prio-normal | usr-prio-normal | act-prio-normal
CP: #Cp<io_lib_format:collect/2+0x90>
I: #Cp<erlang:port_info/1+0x78>
Heap size: 233
Old-heap size: 376
Mbuf size: 0
Msgq len: 1 (inner=1, outer=0)
Parent: <0.2759.0>
Pointer: (Process *) 0x7f20a36010d0
(gdb) etp-stacktrace p
% Stacktrace (27): #Cp<io_lib_format:collect/2+0x90>.
#Cp<io_lib_format:collect/2+0x150>.
#Cp<io_lib_format:fwrite/2+0x118>.
#Cp<io_lib:format/2+0x80>.
#Cp<wombat_plugin_code_tracer:handle_info/2+0x318>.
#Cp<wombat_plugin:handle_info/2+0xc78>.
#Cp<gen_server:try_dispatch/4+0xa0>.
#Cp<gen_server:handle_msg/5+0x6b8>.
#Cp<proc_lib:init_p_do_apply/3+0x58>.
#Cp<terminate process normally>.
It looks like it's a wombat_plugin_code_tracer
process whose heap had been damaged.
Let's take a closer look at the process' stack:
(gdb) etp-stackdump p
% Stackdump (27): #Cp<io_lib_format:collect/2+0x90>.
#Cp<io_lib_format:collect/2+0x150>.
[].
#Cp<io_lib_format:fwrite/2+0x118>.
32.
#Cp<io_lib:format/2+0x80>.
[].
#Cp<wombat_plugin_code_tracer:handle_info/2+0x318>.
#1:[[{timeout,115},{port_op,proc_sig}],[Cannot access memory at address 0x7f1c86142e00
Oops, it looks like a corrupted term is present on the stack and moreover it matches the address we are looking for! Let's use a macro that doesn't try to pretty-print the terms to see the full process' stack:
(gdb) etpf-stackdump p
% Stackdump (27): #Cp<io_lib_format:collect/2+0x90>.
#Cp<io_lib_format:collect/2+0x150>.
[].
#Cp<io_lib_format:fwrite/2+0x118>.
32.
#Cp<io_lib:format/2+0x80>.
[].
#Cp<wombat_plugin_code_tracer:handle_info/2+0x318>.
<etpf-cons 0x7cfa8121>. <-- *this is corrupted*
<etpf-cons 0x6cd4ef89>.
#Catch<364>.
#Cp<wombat_plugin:handle_info/2+0xc78>.
<etpf-cons 0x7cfa8131>.
[].
<etpf-boxed 0x7d1770ca>.
#Cp<gen_server:try_dispatch/4+0xa0>.
<etpf-boxed 0x7d1770fa>.
#Catch<1386>.
#Cp<gen_server:handle_msg/5+0x6b8>.
#Catch<174>.
#Cp<proc_lib:init_p_do_apply/3+0x58>.
wombat_plugin.
<etpf-boxed 0x7d1770fa>.
<0.2845.0>.
<0.2759.0>.
<etpf-boxed 0x7d17714a>.
#Cp<terminate process normally>.
#Catch<126>.
When an Erlang function realizes that it will call another function (in a non tail-recursive way) it allocates some space on stack (by calling the allocate
opcode) in order to migrate terms that need to survive this call it also places the current value of the Continuation Pointer (a next instruction after returning from a function) on top of the stack, because it will need to be replaced, so that the called function knows where is its return point.
So the above stack dump (just the relevant part) in a more convenient form looks as follows:
+--------------------------------------------------------------------------------------------+
|Stack |Ref|Owner |
+--------------------------------------------------------------------------------------------+
|#Cp<io_lib_format:collect/2+0x150>. |CP |#Cp<io_lib_format:collect/2+0x90>. |
|[]. |y0 | |
+--------------------------------------------------------------------------------------------+
|#Cp<io_lib_format:fwrite/2+0x118>. |CP |#Cp<io_lib_format:collect/2+0x150>.|
|32. |y0 | |
+--------------------------------------------------------------------------------------------+
|#Cp<io_lib:format/2+0x80>. |CP |#Cp<io_lib_format:fwrite/2+0x118>. |
|[]. |y0 | |
+--------------------------------------------------------------------------------------------+
|#Cp<wombat_plugin_code_tracer:handle_info/2+0x318>. |CP |#Cp<io_lib:format/2+0x80>. |
|<etpf-cons 0x7cfa8121>. |y0 | |
|<etpf-cons 0x6cd4ef89>. |y1 | |
|#Catch<364>. |y2 | |
+--------------------------------------------------------------------------------------------+
In order to find out what's the corrupted term we need to dive into the assembly code of the io_lib:format/2
function and see what resides in the {y,0}
stack element.
And here's how the assembly of the io_lib
module looks like:
{function, format, 2, 8}.
{label,7}.
{line,[{location,"io_lib.erl",151}]}.
{func_info,{atom,io_lib},{atom,format},2}.
{label,8}.
{allocate,3,2}.
{'catch',{y,2},{f,9}}.
{move,{x,1},{y,0}}.
{move,{x,0},{y,1}}.
{line,[{location,"io_lib.erl",152}]}.
{call_ext,2,{extfunc,io_lib_format,fwrite,2}}.
{label,9}.
{catch_end,{y,2}}.
{test,is_tuple,{f,10},[{x,0}]}.
{test,test_arity,{f,10},[{x,0},2]}.
{get_tuple_element,{x,0},0,{x,1}}.
{test,is_eq_exact,{f,10},[{x,1},{atom,'EXIT'}]}.
{test_heap,4,0}.
{put_list,{y,0},nil,{x,0}}.
{put_list,{y,1},{x,0},{x,1}}.
{move,{atom,badarg},{x,0}}.
{line,[{location,"io_lib.erl",154}]}.
{call_ext,2,{extfunc,erlang,error,2}}.
{label,10}.
{deallocate,3}.
return.
Since the io_lib:fwrite/2
function was still executing (from stacktrace) we can tell that the {y,0}
stack slot is occupied by the format/2
second argument ({x,1}
).
The function was called by wombat_plugin_code_tracer:handle_info/2
, like this:
io_lib:format(" ~p subject info ~p", [Info, subject_info(Subject)])];
We just don't know yet where exactly it was called. But recall how the corrupted term looked like:
[[{timeout,115},{port_op,proc_sig}],[Cannot access memory at address 0x7f1c86142e00
Such term can be generated only by a port long schedule notification, thus the corrupted term is a result of a call to subject_info/1
.
In this case it's a result of a call to erlang:port_info/1
.
(See code of wombat_plugin_code_tracer
for more details. I was using Wombat 1.12.1).
Luckily, a race condition in this function has recently been fixed. Right now I can't confirm that it solves my problem, though ;)
Special thanks go to Lukas Larsson and Simon Zelazny.
Thanks for this! One thing that I noticed after following this gist verbatim, I think:
is supposed to be