- Recently I created some code which appeared to work perfectly when compiled with glibc but often failed with a seg fault when compiled with musl.
- I then tried to create minimal test code to reproduce the issue, but without luck.
- After much fiddling I noticed that the test code would seg fault, but only if it was run via gdb or strace!
- The same test code compiled with glibc never seg faults when run via gdb or strace.
- So I'm hoping this is evidence enough to suggest an issue with musl and not glibc.
- At least it's a starting point... :-)
- Note: the musl mailing list thread discussing this can be found here [1]
[1] https://www.openwall.com/lists/musl/2020/01/27/1
- How the test code works:
- A thread is created and
pthread_create()
populates the global variablemy_pthread
with the new thread ID. - The new thread then uses
my_pthread
in a call topthread_getattr_np()
. - Assumption: You would expect
my_pthread
to be populated before the new thread starts executing. - When compiled with glibc then
my_pthread
always appears to be populated before the new thread starts executing. - When compiled with musl then
my_pthread
sometimes is not populated before the new thread starts executing, which causespthread_getattr_np()
to seg fault. - Note:
my_pthread
is pre-initialized to ensure that the test code attempts to run deterministically.
$ cat hello.c
#define _GNU_SOURCE
#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
#include <stdlib.h>
#ifdef __GLIBC__
static pthread_t my_pthread = 0;
#else
static pthread_t my_pthread = NULL;
#endif
void show_my_pthread(pthread_t * pthread_t_ptr, const char * function_name, const char * hint) {
#ifdef __GLIBC__
printf("debug: my_pthread=%ld // glibc and %s()%s\n", *pthread_t_ptr, function_name, hint);
#else
printf("debug: my_pthread=%p // musl and %s()%s\n", *pthread_t_ptr, function_name, hint);
#endif
fflush(stdout);
}
void * my_thread_main(void * ptr) {
show_my_pthread(&my_pthread, __FUNCTION__, " <- before pthread_getattr_np()");
pthread_attr_t attr;
pthread_getattr_np(my_pthread, &attr);
show_my_pthread(&my_pthread, __FUNCTION__, " <- after pthread_getattr_np()");
}
int main(int argc, char **argv) {
show_my_pthread(&my_pthread, __FUNCTION__, " <-- before pthread_create()");
pthread_create(&my_pthread, NULL, &my_thread_main, NULL);
show_my_pthread(&my_pthread, __FUNCTION__, " <-- after pthread_create()");
pthread_join(my_pthread, NULL);
exit(0);
}
$ ./hello-glibc
debug: my_pthread=0 // glibc and main() <-- before pthread_create()
debug: my_pthread=139629671282432 // glibc and main() <-- after pthread_create()
debug: my_pthread=139629671282432 // glibc and my_thread_main() <- before pthread_getattr_np()
debug: my_pthread=139629671282432 // glibc and my_thread_main() <- after pthread_getattr_np()
$ ./hello-musl
debug: my_pthread=0 // musl and main() <-- before pthread_create()
debug: my_pthread=0x7f5276936f20 // musl and main() <-- after pthread_create()
debug: my_pthread=0x7f5276936f20 // musl and my_thread_main() <- before pthread_getattr_np()
debug: my_pthread=0x7f5276936f20 // musl and my_thread_main() <- after pthread_getattr_np()
- The
printf()
clearly shows thatmy_pthread
is unexpectedly zero when the new thread starts execution. - This causes
pthread_getattr_np()
to seg fault which is also what gdb reports.
$ /bin/gdb --silent -ex="set confirm off" --ex=run -ex=quit --args ./hello-musl
Reading symbols from ./hello-musl...
Starting program: /home/simon/20200124-musl-thread-race-condition/hello-musl
debug: my_pthread=0 // musl and main() <-- before pthread_create()
[New LWP 113431]
debug: my_pthread=0 // musl and my_thread_main() <- before pthread_getattr_np()
debug: my_pthread=0x7ffff7ffaf20 // musl and main() <-- after pthread_create()
Thread 2 "hello-musl" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 113431]
0x0000000000404c6c in pthread_getattr_np ()
Running the test code via musl static, glibc static, and glibc dynamic without gdb or strace 10,000 times
- Observations:
- All 3 builds running without gdb each run 10,000 times without a seg fault.
- Interestingly, musl static takes 6.3 seconds, glibc static takes 7.8 seconds, and glibc dynamic takes 10.1 seconds.
- Interestingly, musl static is 48,416 bytes, glibc static is 1,593,216 bytes, and glibc dynamic is 20,760 bytes.
$ rm -f hello-* ; musl-gcc -g -static -o hello-musl hello.c && ldd hello-musl ; wc --bytes hello-musl && time perl -e 'foreach(1..10000){ $out= `./hello-musl`; $c++; if($out =~ m~SIGSEGV~i){ printf qq[- detected seg fault on run #%d:\n%s], $c, $out; exit; } } printf qq[- ran %d times without seg fault!\n], $c;'
not a dynamic executable
48416 hello-musl
- ran 10000 times without seg fault!
real 0m6.301s
user 0m4.762s
sys 0m3.213s
$ rm -f hello-* ; gcc -g -static -static-libgcc -static-libstdc++ -o hello-glibc hello.c -lpthread && ldd hello-glibc ; wc --bytes hello-glibc && time perl -e 'foreach(1..10000){ $out= `./hello-libc`; $c++; if($out =~ m~SIGSEGV~i){ printf qq[- detected seg fault on run #%d:\n%s], $c, $out; exit; } } printf qq[- ran %d times without seg fault!\n], $c;'
not a dynamic executable
1593216 hello-glibc
- ran 10000 times without seg fault!
real 0m7.776s
user 0m5.917s
sys 0m4.217s
$ rm -f hello-* ; gcc -g -o hello-glibc hello.c -lpthread && ldd hello-glibc ; wc --bytes hello-glibc && time perl -e 'foreach(1..10000){ $out= `./hello-glibc`; $c++; if($out =~ m~SIGSEGV~i){ printf qq[- detected seg fault on run #%d:\n%s], $c, $out; exit; } } printf qq[- ran %d times without seg fault!\n], $c;'
linux-vdso.so.1 (0x00007ffe0c13a000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007faf24862000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007faf24671000)
/lib64/ld-linux-x86-64.so.2 (0x00007faf248a1000)
20760 hello-glibc
- ran 10000 times without seg fault!
real 0m10.066s
user 0m7.287s
sys 0m5.390s
- Note: We only run 100 times because gdb appears to slow things down by an order of magnitude.
- Observations:
- The test code compiled with musl seg faults on the very first run if run via gdb.
- Re-running the musl compile and run 100 times command line mostly fails on the first run but infrequently on a later run such as the second or forth run... at least on my laptop.
- Although this test code reliably seg faults for me when compiled with musl and only when run via gdb, the original issue was discovered without running with gdb, presumably heavily depending upon the running environment, various timing delays, what else is running on the box, etc... typical race condition stuff?
$ rm -f hello-* ; musl-gcc -g -static -o hello-musl hello.c && ldd hello-musl ; wc --bytes hello-musl && time perl -e 'foreach(1..100 ){ $out=`/bin/gdb --silent -ex="set confirm off" --ex=run -ex=quit --args ./hello-musl`; $c++; if($out =~ m~SIGSEGV~i){ printf qq[- detected seg fault on run #%d:\n%s], $c, $out; exit; } } printf qq[- ran %d times without seg fault!\n], $c;'
not a dynamic executable
48416 hello-musl
- detected seg fault on run #1:
Reading symbols from ./hello-musl...
Starting program: /home/simon/20200124-musl-thread-race-condition/hello-musl
debug: my_pthread=0 // musl and main() <-- before pthread_create()
[New LWP 44731]
debug: my_pthread=0 // musl and my_thread_main() <- before pthread_getattr_np()
debug: my_pthread=0x7ffff7ffaf20 // musl and main() <-- after pthread_create()
Thread 2 "hello-musl" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 44731]
0x0000000000404c6c in pthread_getattr_np ()
real 0m0.069s
user 0m0.052s
sys 0m0.019s
$ rm -f hello-* ; gcc -g -static -static-libgcc -static-libstdc++ -o hello-glibc hello.c -lpthread && ldd hello-glibc ; wc --bytes hello-glibc && time perl -e 'foreach(1..100 ){ $out=`/bin/gdb --silent -ex="set confirm off" --ex=run -ex=quit --args ./hello-glibc`; $c++; if($out =~ m~SIGSEGV~i){ printf qq[- detected seg fault on run #%d:\n%s], $c, $out; exit; } } printf qq[- ran %d times without seg fault!\n], $c;'
not a dynamic executable
1593216 hello-glibc
- ran 100 times without seg fault!
real 0m5.334s
user 0m4.227s
sys 0m1.234s
$ rm -f hello-* ; gcc -g -o hello-glibc hello.c -lpthread && ldd hello-glibc ; wc --bytes hello-glibc && time perl -e 'foreach(1..100 ){ $out=`/bin/gdb --silent -ex="set confirm off" --ex=run -ex=quit --args ./hello-glibc`; $c++; if($out =~ m~SIGSEGV~i){ printf qq[- detected seg fault on run #%d:\n%s], $c, $out; exit; } } printf qq[- ran %d times without seg fault!\n], $c;'
linux-vdso.so.1 (0x00007ffdce998000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f8a4f94f000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f8a4f75e000)
/lib64/ld-linux-x86-64.so.2 (0x00007f8a4f98e000)
20760 hello-glibc
- ran 100 times without seg fault!
real 0m18.583s
user 0m15.939s
sys 0m2.790s
- Observations:
- The musl build seg faults when run via strace and the following output shows that
my_pthread
is incorrectly and unexpectedly zero which is causing the seg fault:debug: my_pthread=0 // musl and my_thread_main() <- before pthread_getattr_np()
$ rm -f hello-* ; musl-gcc -g -static -o hello-musl hello.c && ldd hello-musl ; wc --bytes hello-musl && time perl -e 'foreach(1..100 ){ $out=`strace ./hello-musl 2>&1`; $c++; if($out =~ m~SIGSEGV~i){ printf qq[- detected seg fault on run #%d:\n%s], $c, $out; exit; } } printf qq[- ran %d times without seg fault!\n], $c;'
not a dynamic executable
48416 hello-musl
- detected seg fault on run #1:
execve("./hello-musl", ["./hello-musl"], 0x7fff09ffd700 /* 49 vars */) = 0
arch_prctl(ARCH_SET_FS, 0x409618) = 0
set_tid_address(0x409830) = 90267
ioctl(1, TIOCGWINSZ, 0x7ffda93721b0) = -1 ENOTTY (Inappropriate ioctl for device)
writev(1, [{iov_base="debug: my_pthread=0 // musl and "..., iov_len=66}, {iov_base="\n", iov_len=1}], 2debug: my_pthread=0 // musl and main() <-- before pthread_create()
) = 67
rt_sigprocmask(SIG_UNBLOCK, [RT_1 RT_2], NULL, 8) = 0
mmap(NULL, 143360, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f392ce2a000
mprotect(0x7f392ce2c000, 135168, PROT_READ|PROT_WRITE) = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1 RT_2], [], 8) = 0
clone(child_stack=0x7f392ce4ced8, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID|0x400000, parent_tidptr=0x7f392ce4cf58, tls=0x7f392ce4cf20, child_tidptr=0x409830) = 90268
debug: my_pthread=0 // musl and my_thread_main() <- before pthread_getattr_np()
rt_sigprocmask(SIG_SETMASK, [], <unfinished ...>) = ?
+++ killed by SIGSEGV (core dumped) +++
real 0m0.192s
user 0m0.006s
sys 0m0.005s
$ rm -f hello-* ; gcc -g -static -static-libgcc -static-libstdc++ -o hello-glibc hello.c -lpthread && ldd hello-glibc ; wc --bytes hello-glibc && time perl -e 'foreach(1..100 ){ $out=`strace ./hello-glibc 2>&1`; $c++; if($out =~ m~SIGSEGV~i){ printf qq[- detected seg fault on run #%d:\n%s], $c, $out; exit; } } printf qq[- ran %d times without seg fault!\n], $c;'
not a dynamic executable
1593216 hello-glibc
- ran 100 times without seg fault!
real 0m0.769s
user 0m0.535s
sys 0m0.515s
$ rm -f hello-* ; gcc -g -o hello-glibc hello.c -lpthread && ldd hello-glibc ; wc --bytes hello-glibc && time perl -e 'foreach(1..100 ){ $out=`strace ./hello-glibc 2>&1`; $c++; if($out =~ m~SIGSEGV~i){ printf qq[- detected seg fault on run #%d:\n%s], $c, $out; exit; } } printf qq[- ran %d times without seg fault!\n], $c;'
linux-vdso.so.1 (0x00007ffcd6db4000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f39a21a4000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f39a1fb3000)
/lib64/ld-linux-x86-64.so.2 (0x00007f39a21e3000)
20760 hello-glibc
- ran 100 times without seg fault!
real 0m1.401s
user 0m0.700s
sys 0m1.395s
- Using docker to run Alpine Linux, again the test code seg faults only if run via gdb:
$ sudo docker run -it --rm -v `pwd`:/extern alpine:3.11.0 sh
/ # apk update ; apk add gdb gcc libc-dev coreutils perl strace
/ # apk list musl
musl-1.1.24-r0 x86_64 {musl} (MIT) [installed]
/ # cd extern/
/extern # rm -f hello-* ; gcc -g -static -o hello-musl hello.c && ldd hello-musl ; wc --bytes hello-musl && time perl -e 'foreach(1..10000){ $out= `./hello-musl`; $c++; if
($out =~ m~SIGSEGV~i){ printf qq[- detected seg fault on run #%d:\n%s], $c, $out; exit; } } printf qq[- ran %d times without seg fault!\n], $c;'
/lib/ld-musl-x86_64.so.1 (0x7f11e76cb000)
232848 hello-musl
- ran 10000 times without seg fault!
real 0m 5.89s
user 0m 4.46s
sys 0m 2.73s
/extern # rm -f hello-* ; gcc -g -static -o hello-musl hello.c && ldd hello-musl ; wc --bytes hello-musl && time perl -e 'foreach(1..100 ){ $out=`/usr/bin/gdb --silent -ex="set confirm off" --ex=run -ex=quit --args ./hello-musl`; $c++
; if($out =~ m~SIGSEGV~i){ printf qq[- detected seg fault on run #%d:\n%s], $c, $out; exit; } } printf qq[- ran %d times without seg fault!\n], $c;'
/lib/ld-musl-x86_64.so.1 (0x7f2d91d58000)
232848 hello-musl
warning: Error disabling address space randomization: Operation not permitted
9 src/thread/pthread_getattr_np.c: No such file or directory.
- detected seg fault on run #1:
Reading symbols from ./hello-musl...
Starting program: /extern/hello-musl
debug: my_pthread=0 // musl and main() <-- before pthread_create()
[New LWP 20183]
debug: my_pthread=0 // musl and my_thread_main() <- before pthread_getattr_np()
debug: my_pthread=0x7fc79b3bdf20 // musl and main() <-- after pthread_create()
Thread 2 "hello-musl" received signal SIGSEGV, Segmentation fault.
[Switching to LWP 20183]
pthread_getattr_np (t=0x0, a=0x7fc79b3bde70) at src/thread/pthread_getattr_np.c:9
real 0m 0.12s
user 0m 0.09s
sys 0m 0.03s
- Running ldd on
hello-musl
on Alpine reports a .so file being used. - Running ldd on
hello-musl
on Ubuntu reports a no .so file being used. - Why?
/extern # sha256sum hello-musl ; ldd hello-musl
2c51a79cec19fbcdc22e8c1be586af81847b9e2bea37449e6827bae291c1c7b7 hello-musl
/lib/ld-musl-x86_64.so.1 (0x7fcb2bf64000)
$ sha256sum hello-musl ; ldd hello-musl
2c51a79cec19fbcdc22e8c1be586af81847b9e2bea37449e6827bae291c1c7b7 hello-musl
statically linked