This is a story about an operating system problem that occupied me at work in 2015 for couple of days. The goal is to give you an overview of debugging techniques and associated tools. I'd like to use this opportunity (writing this article for my students of the Unix programming class at the Mathematical-physical faculty of the Charles university in Prague in 2021) to provide some historical points, some background and some details about operating system work (I feel old writing this). As a bit of a spoiler, the cause of the problem turned out to be completely different than what it seemed initially and involved traversing multiple layers in the operating system. I love this kind of problems because it is fun and great learning experience. I vaguely remember a job advertisement in Netfix bunch of years back that demanded that the applicants posses intimate knowledge of the operating system internals down to (and including) the hardware and be able to fix problems spanning these layers within a single day. Given it took me couple of days to root cause this one (with extensive help of a colleague) I guess it would be hard for me to pass :-) I presented this problem at least twice, once at MFF UK (Charles university), once at Technical University Liberec. Actually, the very first time I talked about this problem was internally during what we called "bug of the week" which was an informal meeting where anyone could present any interesting problem he was working on (solved or not) in front of a whiteboard. We've just entered 2021 and I can no longer readily reproduce the problem (more on that later) so I will have to rely on my notes so the information flow might not be fully contiguous; please bear with me.
Before we dive into the meat of the problem, let me give you somewhat broad background. The operating system in question is Solaris. Solaris 2.0 (codename Callisto, based on System V rel. 4, different to BSD OS based SunOS) was released in june 1992. I was in basic school at that point, probably playing with 16-bit computers (Atari ST) and maybe even PCs. Solaris 10 that is still supported today (till 2024 according to the current information) was released in January 2005, around the time I was finishing my master studies at the university. In 2006 I joined Sun Microsystems as a sustaining engineer, to fix bugs in security technologies in Solaris, a dream come true. I remember reading Jeff Bonwick's whitepaper from 1994 about the slab allocator (https://dl.acm.org/doi/10.5555/1267257.1267263) in high school (during lectures) and now I found myself in the same company, not very far organization wise from the engineers working on Solaris internals (that actually still had Bonwick and other "stars" on the team). This was the period of OpenSolaris inception, and in general a period of opening the Solaris operating system development to the wide world.
This article would be quite nice for that period of time (however is untimely both forward and backward wise as you will learn below).
In 2007 I needed Netcat in Solaris for general use and testing so after some hurdles (mostly related to internal processes and dependencies such as warn/err functions missing in libc etc.) I took the version from OpenBSD (being long time fan of the project) and integrated it into the ON (OS/Net) consolidation of Solaris (the release is comprised from bits from various consolidations). I even wrote a test suite for Netcat from scratch and delivered it into the STC consolidation so it could be run periodically (automatically for the bi-weekly build of Solaris) and/or on demand. It was based on the CTI/TET framework (not very pleasant experience) but it did the job.
In 2009, a periodic run of the test suite uncovered a bug in the in-kernel bcopy()/kcopy()
implementation. The fix for this problem was delivered by an engineer from Intel. I called that "collateral salutariness" (as opposed to collateral damage) because it prevented a regression from appearing in the bits delivered to the customers that could be hard to track down and could impact other components of the system in ways that would be hard to debug.
In 2010 Sun was bought by Oracle and the ON source code was closed. The development still continues internally and I believe the Netcat implementation I integrated to ON is still part of the OpenSolaris derived systems such as Illumos. Any source code references here are taken from the OpenSolaris code right at the point before it become closed.
In March 2015 a bug was created against Netcat in Solaris by the ON-PIT team. This is a team that runs the infrastructure to test Solaris development builds. The members of the team also provide baselining and basic root causing. This was the period when Solaris 12 was being developed (later to be reworked and named Solaris 11.4). The problem appereared to be new in build 69 and seemed to impact only x86 based system (Solaris runs on both SPARC and x86 platforms). The test suite has tens of test purposes (in CTI/TET lingo a test cases covers a distinct functionality, while a test purpose covers part of that functionality), there were also tens of failures that seemed to share the same cause so the problem seemed to be external to Netcat. Also, there were no changes to Netcat for couple of builds.
What did I do as a first thing ? Correct, tried to reproduce it !
root@va64-v20zl-prg06:~# MALLOC_DEBUG=RW LD_PRELOAD=watchmalloc.so.1 nc -l 4444
Trace/Breakpoint Trap (core dumped)
root@va64-v20zl-prg06:~# pstack core
core 'core' of 15310: nc -l 4444
00007ff1acf91064 __getaddrinfo () + 524
00007ff1ad000d9f __xnet_getaddrinfo () + f
0000000000407889 local_listen () + 39
0000000000406c15 main () + 17c5
0000000000405204 ???????? ()
What does this do ? Taking the environment variables aside, it runs Netcat in server mode, making it listen on TCP socket on port 4444.
In the above output we can see actually 2 debugging tools in action. The first tool is the watchmalloc
library that overrides the default heap allocator in libc (by using the LD_PRELOAD
environment variable as a hint to the dynamic linker and the MALLOC_DEBUG
environment variable to control the behavior of the library) and that provides error checking capabilities. When the program performs an operation that violates some of the rules, it is forcibly terminated, in this case with the SIGTRAP
signal.
To look at the resulting core file, second tool is used, pstack
can display the stack(s) of either running process or from a core dump. The __getaddrinfo()
call comes from the standard (yes, I mean Single UNIX Specification, in this case POSIX) getaddrinfo()
call.
Now, the test suite runs with the watchmalloc
in order to enforce more stricness of how the Netcat should deal with memory on the heap.
At this point, it might be tempting to throw the bug over the fence to the Naming team (yes, silo-ing ia a thing in OS development), however it deserves a bit more investigation, especially given that there were no changes in the getaddrinfo()
in the last couple of builds.
When I run the program without the watchmalloc
library, everything is fine. What does actually happen inside nc
and libc
when the getaddrinfo()
is called ? Let's use another tool, truss
, that can trace both system and library calls (like ltrace
on Linux):
root@va64-v20zl-prg06:~# truss -u libc::*getaddrinfo -t so_socket,accept,bind -v bind nc -l 4444
/1: -> libc:__xnet_getaddrinfo(0x0, 0xffff80ffbffffe6e, 0xffff80ffbffff0f0, 0xffff80ffbffff0a8, 0x30, 0x101010101010101)
/1: -> libc:__getaddrinfo(0x0, 0xffff80ffbffffe6e, 0xffff80ffbffff0f0, 0xffff80ffbffff0a8, 0x1, 0x101010101010101)
/1: <- libc:__getaddrinfo() = 0
/1: <- libc:__xnet_getaddrinfo() = 0
/1: so_socket(PF_INET6, SOCK_STREAM, IPPROTO_TCP, NULL, SOV_XPG4_2) = 3
/1: bind(3, 0x0041FB80, 32, SOV_XPG4_2) = 0
/1: AF_INET6 name = :: port = 4444
/1: scope id = 0 source id = 0x0
/1: flow class = 0x00 flow label = 0x00000
/1: accept(3, 0xFFFF80FFBFFFF720, 0xFFFF80FFBFFFF17C, SOV_DEFAULT) (sleeping...)
Looks like the result of getaddrinfo()
is used to populate the arguments of the socket()
and bind()
system calls. In this case, a IPv6 TCP wildcard socket is created (that can actually accept IPv4 connections as well).
From the core file we know that the crash happened in libc
's __getaddrinfo+0x524
. How to identify the source code line matching this offset ?
Let's try to match the C code to the disassembly. This can be done either with the dis
command:
$ dis -F __getaddrinfo /lib/64/libc.so.1
or using the mdb
debugger:
$ mdb core << EOF
$c
libc.so.1`__getaddrinfo+0x524::dis
EOF
Note that nc
has been converted to 64-bit program some time ago, hence the /64
path component in the above command. Therefore we will be dealing solely with 64-bit environment.
The advantage of using the core file (or live program stopped at the getaddrinfo()
function before it crashes) over the dis
command is that the relocations have been performed and PLT (Procedure Linkage Table) entries are populated so that we can see the actual function names.
You might ask: given this problem can be reproduced at will, why not just to single step the program and identify the crashing instruction in debugger or use debug version of the libc that has source code information embedded ? These 2 conditions are quite rare in the real world, especially when dealing with kernel crashes (for which you usually obtain core file as well) so engineers are usually quite well versed in mapping disassembly to source code so why not to hone this skill further ? This is actually a critical skill for anyone who ever worked in Solaris sustaining. I used to have my table (sometimes the floor as well) at work covered with prints of disassembly annotated with C source code.
Let's go:
> libc.so.1`__getaddrinfo+0x524::dis
libc.so.1`__getaddrinfo+0x4f4: movq $0x30,%rdi
libc.so.1`__getaddrinfo+0x4fb: call -0x6e7e0 <PLT=watchmalloc.so.1`malloc>
- this is
malloc(0x30)
(sizeof (struct addrinfo) = 0x30 = 48 bytes
)
libc.so.1`__getaddrinfo+0x500: movq %rax,%rbx
libc.so.1`__getaddrinfo+0x503: testq %rbx,%rbx
libc.so.1`__getaddrinfo+0x506: je +0x1f8 <libc.so.1`__getaddrinfo+0x704>
- check the malloc() return value,
rbx
now contains the address of a structure to be later identified asnai
.
libc.so.1`__getaddrinfo+0x50c: movdqu -0x80(%rbp),%xmm0
- move 16 bytes at offset 0 of a structure to be later identified as
aip
toxmm0
.
libc.so.1`__getaddrinfo+0x511: movdqu -0x70(%rbp),%xmm1
- move 16 bytes at offset 16 of
aip
toxmm1
libc.so.1`__getaddrinfo+0x516: movdqu %xmm0,(%rbx)
- move the contents of
xmm0
tonai
at offset 0
libc.so.1`__getaddrinfo+0x51a: movdqu %xmm1,0x10(%rbx)
- move the contents of
xmm1
tonai
offset 16
libc.so.1`__getaddrinfo+0x51f: movdqu -0x60(%rbp),%xmm0
- move 16 bytes at offset 32 of
aip
toxmm0
- this is the last piece
libc.so.1`__getaddrinfo+0x524: movdqu %xmm0,0x20(%rbx) <<< SIGTRAP
- move
xmm0
to offset 32 ofnai
libc.so.1`__getaddrinfo+0x529: movq $0x20,%rdi
libc.so.1`__getaddrinfo+0x530: call -0x6e815 <PLT=watchmalloc.so.1`malloc>
- call
malloc(0x20)
(0x20
issizeof (struct sockaddr_in6)
)
libc.so.1`__getaddrinfo+0x535: movq %rax,0x20(%rbx)
libc.so.1`__getaddrinfo+0x539: testq %rax,%rax
libc.so.1`__getaddrinfo+0x53c: je +0x1ba <libc.so.1`__getaddrinfo+0x6fc>
libc.so.1`__getaddrinfo+0x542: movq %rax,%rdi
libc.so.1`__getaddrinfo+0x545: movq $0x20,%rsi
bzero()
works well for the matching since__getaddrinfo()
contains just 2 calls tobzero()
libc.so.1`__getaddrinfo+0x54c: call -0x6b2e1 <libc.so.1`bzero>
libc.so.1`__getaddrinfo+0x551: movl $0x20,0x10(%rbx)
libc.so.1`__getaddrinfo+0x558: movl $0x1a,%eax
This matches this part of _getaddrinfo()
implementation in libc
(Netcat is called in listener more without address specification so the hostname
pointer is NULL
so this makes sense):
429 /*
430 * hostname is NULL
431 * case 1: AI_PASSIVE bit set : anyaddr 0.0.0.0 or ::
432 * case 2: AI_PASSIVE bit not set : localhost 127.0.0.1 or ::1
433 */
434
435 if (hostname == NULL) {
436 struct addrinfo *nai;
437 socklen_t addrlen;
438 char *canonname;
439
440 if (aip->ai_family == PF_INET)
441 goto v4only;
442 /* create IPv6 addrinfo */
443 nai = malloc(sizeof (struct addrinfo)); << __getaddrinfo+0x4fb
444 if (nai == NULL)
445 goto nomem;
446 *nai = *aip; << here the process got SIGTRAP !!!
447 addrlen = sizeof (struct sockaddr_in6);
448 nai->ai_addr = malloc(addrlen); << __getaddrinfo+0x530
449 if (nai->ai_addr == NULL) {
450 freeaddrinfo(nai);
451 goto nomem;
452 }
453 bzero(nai->ai_addr, addrlen); << __getaddrinfo+0x54c
454 nai->ai_addrlen = addrlen;
455 nai->ai_family = PF_INET6;
456 nai->ai_canonname = NULL;
The operation on line 446 is structure to structure copy, i.e. the aip
structure is copied to the nai
structure, so in the end they should match byte by byte. In our case this operation never finished because of the SIGTRAP
.
Looking at the source of the copy operation, i.e. the aip
structure:
<rbp-0x80=K
ffff80ffbfffefc0
> ffff80ffbfffefc0::print struct addrinfo
{
ai_flags = 0x8
ai_family = 0
ai_socktype = 0x2
ai_protocol = 0x6
ai_addrlen = 0
ai_canonname = 0
ai_addr = 0
ai_next = 0
}
> ::sizeof struct addrinfo
sizeof (struct addrinfo) = 0x30
The 4 bytes values at the start of the buffer are:
- 8 is
ai_flags
, this isAI_PASSIVE
- 0 is
ai_family
, not being set yet - 2 is
ai_socktype
, this isSOCK_STREAM
- 6 is
ai_protocol
, this istcp
So this indeed looks like a valid addrinfo
structure:
struct addrinfo {
int ai_flags; /* AI_PASSIVE, AI_CANONNAME, ... */
int ai_family; /* PF_xxx */
int ai_socktype; /* SOCK_xxx */
int ai_protocol; /* 0 or IPPROTO_xxx for IPv4 and IPv6 */
#ifdef __sparcv9
int _ai_pad; /* for backwards compat with old size_t */
#endif /* __sparcv9 */
socklen_t ai_addrlen;
char *ai_canonname; /* canonical name for hostname */
struct sockaddr *ai_addr; /* binary address */
struct addrinfo *ai_next; /* next structure in linked list */
};
How does the debugger actually know the offsets of the members of the structures ? In this case it uses so called CTF data that is attached to the libc binary during the build (C-Type Format compact representation of the data types used for building the binary, used also for the kernel) however there are other debugging metadata types such as DWARF (in fact in some cases the CTF is built from DWARF on Solaris).
Let's take a look at the rbx
register because this is the malloc()
return value, i.e. the nai
structure:
> <rbx=K
41cba0
and display the contents of this buffer (20X
means display 20 bytes in 4 byte quantities) at the point of the crash:
> 41cba0/20X
0x41cba0: 8 0 2 6 0 0 0 0 baddcafe
baddcafe baddcafe baddcafe 30 0 baddcafe baddcafe 41cc10 0
deadbeef deadbeef
SIGTRAP
was delivered when it was writing 16 bytes on 0x41cba0 + 0x20 = 0x41cbc0
, however the structure does not end there since the allocator returns a buffer of 48 bytes; these 4 baddcafe
patterns should be writeable. The watchmalloc(3MALLOC) man page explains this pattern:
The malloc() function returns with the allocated memory filled with the pattern 0xbaddcafe as a precaution against applications incorrectly expecting to receive back unmodified memory from the last free().
I.e. the first 32 bytes were already copied. The last 16 bytes were not because the buffer still has these patterns. Here is how the buffer got populated, time going downwards:
41cba0 41cbb0 41cbc0 41cbd0
+--------+--------+--------+--------+--------+--------+
| | | | | | |
+--------+--------+--------+--------+--------+--------+
^^^^^^^^^^^^^^^^^^
8 0 2 6
^^^^^^^^^^^^^^^^^
zeroes
^^^^^^^^^^^^^^^^^^
unfinished (SIGTRAP)
The 30 0
bytes following these baddcafe
words are heap allocator (in this case watchmalloc
) matadata. More on that below.
Debugging with ever changing pointer is tedious. The heap pointer changes because of ASLR (Address Space Layout Randomization). In fact it changes the location of heapu, stack, anon segment, libraries (text/data stay put).
By default ASLR is turned on only for binaries that contain certain pieces of information recorded by the link-editor that they are to be run with ASLR. The information is stored in the .dynamic
section of the ELF file. This can be seen with the elfdump
command (another debugging tool, provides similar functionality like readelf/objdump on Linuxu):
$ elfdump -d /usr/bin/nc | grep -i aslr
[37] SUNW_ASLR 0x2 ENABLE
How to disable it ? This functionality is part of the Solaris Security extensions framework that can be controlled using the sxadm
command, e.g. this disables this for single run of a program:
sxadm exec -s aslr=disable /usr/bin/pmap self
It is certainly possible to disable it system wide:
# sxadm disable aslr
# sxadm status
EXTENSION STATUS CONFIGURATION
aslr disabled disabled
nxstack enabled (all) default (default)
nxheap enabled (tagged-files) default (default)
This is a bit too hard core, is is sufficient to run a shell with ASLR disabled and the setting will be inherited to the child processes:
sxadm exec -s aslr=disable /bin/bash
It should be also possible to edit the ELF binary itself (perhaps using the elfedit
tool, I haven't checked).
watchmalloc
is a simple heap allocator. It delivers the same overall functionality as the default heap allocator in libc however provides means to enforce correctness. For example, it crashes the application whenever it tries to write to free'd memory. It also employs red zones that write protect allocator metadata. The way a heap allocator works is that it requests memory from the kernel and then divides and conquers this memory. Each buffer managed by the heap allocator and handed out to the caller (e.g. via malloc()/calloc() etc.) has a header carrying the metadata. The metadata is used for the buffer management (e.g. connecting the buffer to the data structure used by the allocator, for coalescing of the buffers, ...) and misused by heap overflow exploits.
When watchmalloc
is run with WATCH
or RW
flag, it will request kernel to protect the buffer headers carrying the metadata. Let's take a look at how this is done.
The main data structures used:
/*
* All of our allocations will be aligned on the least multiple of 4,
* at least, so the two low order bits are guaranteed to be available.
*/
#ifdef _LP64
#define ALIGN 16
#else
#define ALIGN 8
#endif
/* the proto-word; size must be ALIGN bytes */
typedef union _w_ {
size_t w_i; /* an unsigned int */
struct _t_ *w_p[2]; /* two pointers */
} WORD;
/* structure of a node in the free tree */
typedef struct _t_ {
WORD t_s; /* size of this element */
WORD t_p; /* parent node */
WORD t_l; /* left child */
WORD t_r; /* right child */
WORD t_n; /* next in link list */
WORD t_d; /* dummy to reserve space for self-pointer */
} TREE;
The size of WORD
in 64-bit program is 16 bytes. The free list is normally represented by self balancing binary tree.
For small buffers (less than MINSIZE
which is defined as sizeof (TREE) - WORDSIZE
and equals to 96 - 16 = 80 bytes in 64-bit program) malloc()
calls smalloc()
as a fast path.
I added couple of debug prints to smalloc()
(could have traced the watchmalloc
library calls with truss
however the LD_PRELOAD
of watchmalloc
would affect the heap allocations performed by the truss
itself) to print the addresses in the main path of the function:
/*
* Keep lists of small blocks, LIFO order.
*/
static TREE *List[MINSIZE/WORDSIZE-1];
static TREE *Last[MINSIZE/WORDSIZE-1];
...
static void *
smalloc(size_t size)
{
TREE *tp;
size_t i;
...
/* list to use */
i = size / WORDSIZE - 1;
...
/* allocate from the head of the queue */
tp = List[i];
unprotect(tp);
if ((List[i] = AFTER(tp)) == NULL)
Last[i] = NULL;
copy_pattern(LIVEPAT, tp); // LIVEPAT = 0xbaddcafebaddcafeULL
SETBIT0(SIZE(tp)); // marks the buffer as 'busy', #define SIZE(b) (((b)->t_s).w_i)
protect(tp);
return (DATA(tp)); // #define DATA(b) (((char *)(b)) + WORDSIZE)
}
In our case the debug prints yield:
UNPROTECT 41cb90 (16)
SMALLOC 41cb90 41cba0 49
PROTECT 41cb90 16
The allocation requested 48 bytes. The tp
is 41cb90
, smalloc()
returned 41cba0
, which is 16 bytes further.
Only the metadata header is therefore protected. This is also mentioned in the wathmalloc(3MALLOC)
man page:
A header is maintained before each block of allocated memory. Each header is covered with a watched area, thereby providing a red zone before and after each block of allocated memory (the header for the subsequent memory block serves as the trailing red zone for its preced- ing memory block). Writing just before or just after a memory block returned by malloc() will trigger a watchpoint trap.
However, how come the smalloc()
actually returns the memory of the TREE
structure ? This is simply a trick used by the allocator. For these small buffers only the first member of the TREE
structure matters. The rest is used for the data of the buffer. That's why the MINSIZE
is computed as sizeof (TREE) - sizeof (WORD)
. Antoher interesting trick is the use of lower bits in the address to store additional metadata like the used buffer indication. This is possible due to the buffers returned from malloc()
being aligned and therefore having low bits always zero.
The write protection is something the program cannot do by itself; it needs kernel to do it. How exactly does watchmalloc
request the header protection from the kernel ? It uses the procfs
pseudo file system in protect(tp)
.
The protect()
call does this:
protect(TREE *tp)
...
ctl.cmd = PCWATCH;
ctl.prwatch.pr_vaddr = (uintptr_t)tp;
ctl.prwatch.pr_size = size + WORDSIZE;
ctl.prwatch.pr_wflags = wflags;
(void) write(ctlfd, &ctl, sizeof (ctl));
where ctlfd
is file descriptor from open of /proc/<pid>/ctl
which can be seen in mdb
via the ::pfiles
d-command or via the pfiles(1) program (here we can see the former in action):
> ffffc10007005160::pfiles
FD TYPE VNODE INFO
0 CHR ffffc1000aad2740 /dev/pts/1
1 CHR ffffc1000aad2740 /dev/pts/1
2 CHR ffffc1000aad2740 /dev/pts/1
3 PROC ffffc106e5bc5340 /proc/15451/ctl (proc=ffffc10007005160)
What does procfs
do with the request in the kernel ? pcontrol.c#pr_watch()
:
pwa = kmem_alloc(sizeof (struct watched_area), KM_SLEEP);
pwa->wa_vaddr = (caddr_t)vaddr;
pwa->wa_eaddr = (caddr_t)vaddr + size;
pwa->wa_flags = (ulong_t)wflags;
error = ((pwa->wa_flags & ~WA_TRAPAFTER) == 0)?
clear_watched_area(p, pwa) : set_watched_area(p, pwa);
set_watched_area()
does this:
target = pr_find_watched_area(p, pwa, &where);
...
avl_insert(&p->p_warea, pwa, where);
error = set_watched_page(p, vaddr, eaddr, flags, 0);
p
isproc_t
pointer (this is a giant kernel structure representing a process)pwa
is a structure saved into an AVL tree inproc_t
- the tree is created in
getproc()
like so:
- the tree is created in
/*
* Initialize watchpoint structures
*/
avl_create(&cp->p_warea, wa_compare, sizeof (struct watched_area),
offsetof(struct watched_area, wa_link));
- and
wa_compare()
is the comparison function based on the start of the protected memory chunk:
if (pa->wa_vaddr < pb->wa_vaddr)
return (-1);
else if (pa->wa_vaddr > pb->wa_vaddr)
return (1);
else
return (0);
The granularity of the protection is a page, i.e. every page that has at least one piece of protected metadata is set as write-protected and whenever the process performs write it will cause a trap. This is one of the reasons why watchmalloc
incurs significant performance penalty. Given that there will be most likely other (legitimate) writes to different locations of a page that has at least one watchmalloc
metadata buffer and therefore is write protected, the lookup whether the write falls into one of the protected areas within the page needs to be quick. Hence the AVL tree. The watchmalloc
man page talks about making the program 10 to 1000 times slower, depending on its allocation pattern.
The trap is handled in trap()
as a page-fault (#PF). Here is x86 specific trap handling code:
746 ASSERT(!(curthread->t_flag & T_WATCHPT));
747 watchpage = (pr_watch_active(p) && pr_is_watchpage(addr, rw));
748 vaddr = addr;
749 if (!watchpage || (sz = instr_size(rp, &vaddr, rw)) <= 0)
750 fault_type = (errcode & PF_ERR_PROT)? F_PROT: F_INVAL;
751 else if ((watchcode = pr_is_watchpoint(&vaddr, &ta,
752 sz, NULL, rw)) != 0) {
753 if (ta) {
754 do_watch_step(vaddr, sz, rw,
755 watchcode, rp->r_pc);
756 fault_type = F_INVAL;
757 } else {
758 bzero(&siginfo, sizeof (siginfo));
759 siginfo.si_signo = SIGTRAP;
760 siginfo.si_code = watchcode;
761 siginfo.si_addr = vaddr;
762 siginfo.si_trapafter = 0;
763 siginfo.si_pc = (caddr_t)rp->r_pc;
764 fault = FLTWATCH;
765 break;
766 }
767 } else {
768 /* XXX pr_watch_emul() never succeeds (for now) */
769 if (rw != S_EXEC && pr_watch_emul(rp, vaddr, rw))
770 goto out;
771 do_watch_step(vaddr, sz, rw, 0, 0);
772 fault_type = F_INVAL;
773 }
774
775 res = pagefault(addr, fault_type, rw, 0);
It is therefore necessary to find out whether given memory chunk accessed by the movdqu
instruction matches one of the nodes in the AVL tree.
Some interesting tidbits:
instr_size()
calls thedtrace
disassembler to determine how big is the memory chunk is being read/written by the instructionpr_is_watchpoint()
then finds out whether the address on which the trap happened (vaddr
) matches (sz
) one of the tree nodes
At this point my hypothesis was that this is a bug in watchmalloc
or procfs
.
To confirm it, we will need to look at the AVL tree nodes to see the list of protected memory chunks.
Again, thanks to ASLR being off, the addresses will remain the same across multiple program executions.
Instead of getting SIGTRAP
I will break the program execution so that the in-kernel structures can be inspected:
> ::pgrep nc
S PID PPID PGID SID UID FLAGS ADDR NAME
R 15451 15418 15451 15305 0 0x4a004000 ffffc10007005160 nc
> ffffc10007005160::print -a proc_t p_warea
ffffc10007005348 p_warea = {
ffffc10007005348 p_warea.avl_root = 0xffffc106e5bbbb60
ffffc10007005350 p_warea.avl_compar = wa_compare
ffffc10007005358 p_warea.avl_offset = 0
ffffc10007005360 p_warea.avl_numnodes = 0x103
ffffc10007005368 p_warea.avl_size = 0x30
}
> ffffc10007005160::print -a proc_t p_warea | ::walk avl | ::print struct watched_area wa_vaddr wa_eaddr ! head
wa_vaddr = 0x41cb40
wa_eaddr = 0x41cb50
wa_vaddr = 0x41cb50
wa_eaddr = 0x41cb60
wa_vaddr = 0x41cb90
wa_eaddr = 0x41cba0
wa_vaddr = 0x41cbd0
wa_eaddr = 0x41cc10
wa_vaddr = 0x41cc10
wa_eaddr = 0x41cc50
This is the relevant part of the output:
wa_eaddr = 0x41cb60
wa_vaddr = 0x41cb90
wa_eaddr = 0x41cba0
wa_vaddr = 0x41cbd0
wa_eaddr = 0x41cc10
The protected memory region is [0x41cb90, 0x41cba0)
(which matches the debug output of smalloc()
above) and the next one is [0x41cbd0, 0x41cc10)
. This means that data buffer [0x41cba0, 0x41cbd0)
which is 48 bytes is not protected.
Here is how it looks like in memory:
41cb90 41cba0 41cbb0 41cbc0 41cbd0 41cc10
+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
| red zone | b u f f e r | red zone |
+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
This means this is not a fault of either procfs
or watchmalloc
.
This was the moment where I engaged my colleague Tomas Jedlicka (who used to be actually one of my students in the Unix programming class at MFF UK; now works at Apple I believe) sitting in same area (these were the times before COVID) who was working as Solaris kernel developer at that point (and before that he was a member of the Solaris naming team responsible for implementations of technologies such as DNS or LDAP and also APIs like getaddrinfo()
, so that was perfect fit :-)).
How to verify that at given moment rbx
contained indeed 0x41cba0
? (i.e. to make sure the address did not get garbled on the way) Let's have a look at the kernel thread stack for the LWP (Light Weight Process) in given process:
> ::pgrep nc
S PID PPID PGID SID UID FLAGS ADDR NAME
R 15451 15418 15451 15305 0 0x4a004000 ffffc10007005160 nc
> ffffc10007005160::walk thread | ::findstack
stack pointer for thread ffffc10005f89240: fffffffc810e7d20
[ fffffffc810e7d20 _resume_from_idle+0x14f() ]
fffffffc810e7d50 swtch+0x165()
fffffffc810e7dd0 stop+0x7d4()
fffffffc810e7e10 stop_on_fault+0xa6()
fffffffc810e7ef0 trap+0x1369()
fffffffc810e7f00 0xfffffffffb8001a7()
Need to look into the trap frame. After the page fault the HW saved the registers to stack:
> fffffffc810e7ef0/10K
0xfffffffc810e7ef0: fffffffc810e7f00 fffffffffb8001a7 0 7ff1acf91064 7ff1a55f4110 0 0 0
0 ffffc10005f89240
> fffffffc810e7f00::print struct regs
{
r_savfp = 0
r_savpc = 0x7ff1acf91064
r_rdi = 0x7ff1a55f4110
r_rsi = 0
r_rdx = 0
r_rcx = 0
r_r8 = 0
r_r9 = 0xffffc10005f89240
r_rax = 0x41cba0
r_rbx = 0x41cba0 << yep, this is it !
r_rbp = 0xffff80ffbffff040
r_r10 = 0x10
r_r11 = 0x246
r_r12 = 0xffff80ffbffff068
r_r13 = 0xffff80ffbfff5c11
r_r14 = 0
r_r15 = 0
__r_fsbase = 0
__r_gsbase = 0
r_ds = 0
r_es = 0
r_fs = 0
r_gs = 0
r_trapno = 0xe << this is $14, it matches the page fault trap (uts/intel/ia32/ml/exceptions.s)
r_err = 0x7
r_rip = 0x7ff1acf91064
r_cs = 0x53
r_rfl = 0x10206
r_rsp = 0xffff80ffbfffefa0
r_ss = 0x4b
}
If procfs
or the watch point infrastructure itself is not at fault, we need to look further down, at the addresses passed to pr_is_watchpoint()
. The ideal tool for this is dtrace
, a dynamic instrumentation tool (I sort of hesitate to call it a tool because it opens so many possibilities to observe the system and therefore allows to ask questions that would not be possible to answer with other tools), specifically the FBT (Function Boundary Tracing provider) using this one liner:
$ dtrace -n 'fbt::trap:entry/execname == "nc"/ { printf("%p", arg1); self->t = 1; } fbt::pr_is_watchpoint:entry/execname == "nc" && self->t/ { printf("%p %ld", *((caddr_t *)arg0), arg2); } fbt::pr_is_watchpoint:return/self->t/ { trace(arg1); } fbt::trap:return/self->t/ { self->t = 0; }'
This allows to trace kernel function calls that are result of specific userland function calls. In this case we only want to trace calls to trap
that happened from within a nc
program (could as well used a specific PID however on the test system I was sure no one else would run nc
). Next, I wanted to see the arguments of the pr_is_watchpoint()
call that happened from within the call tree of that specific trap
call.
Looking at the output from dtrace the last call to pr_is_watchpoint()
is:
0 17077 trap:entry 0
0 17077 trap:entry ffff4ff0
0 17077 trap:entry 41cbc8
0 33469 pr_is_watchpoint:entry 41cbc8 16
0 33470 pr_is_watchpoint:return 4
But rbx
was 41cba0
, this is 40 bytes away. The 16
in the pr_is_watchpoint:entry
probe is the size of the data in bytes, i.e. the width of the instruction argument. This makes sense given that movdqu
copies 16 byte buffers.
Our buffer is [0x41cba0, 0x41cbd0)
, i.e. 0x41cbc8+0t16
(0t
is prefix used in mdb
for decimal number representation) is 0x41cbd8
, this is indeed beyond the boundary of the buffer. In fact it falls into the red zone (i.e. write protected) of the next buffer. From the AVL tree dump above we know the protected areas:
41cb90 41cba0 41cbb0 41cbc0 41cbc8 41cbd0 41cbd8 41cc10
+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
| red zone | b u f f e r | red zone |
+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
xxxxxxxxxxxxxxxxx
The dtrace output tells us that the return value from pr_is_watchpoint()
is 4, i.e.:
155 #define TRAP_WWATCH 4 /* write access watchpoint trap */
Simply, to the watchpoint code this looks like as if there was an instruction trying to write 16 bytes at 0x41cbc8
. Given this matches a protected area, it has to fail. From the disassembly we know that the actual instruction was in fact writing 16 bytes at 41cbc0
.
Where did the VA come from ? It is being passed to the call to trap()
:
461 trap(struct regs *rp, caddr_t addr, processorid_t cpuid)
How is trap()
being called ? From the table defined in uts/intel/ia32/ml/exception.s
:
1054 /*
1055 * #PF
1056 */
1057 ENTRY_NP(pftrap)
1058 TRAP_ERR(T_PGFLT) /* $14 already have error code on stack */
1059 INTR_PUSH
1060#if defined(__xpv)
1061
1062#if defined(__amd64)
1063 movq %gs:CPU_VCPU_INFO, %r15
1064 movq VCPU_INFO_ARCH_CR2(%r15), %r15 /* vcpu[].arch.cr2 */
1065#elif defined(__i386)
1066 movl %gs:CPU_VCPU_INFO, %esi
1067 movl VCPU_INFO_ARCH_CR2(%esi), %esi /* vcpu[].arch.cr2 */
1068#endif /* __i386 */
1069
1070#else /* __xpv */
1071
1072#if defined(__amd64)
1073 movq %cr2, %r15
1074#elif defined(__i386)
1075 movl %cr2, %esi
1076#endif /* __i386 */
1077
1078#endif /* __xpv */
1079 jmp cmntrap_pushed
1080 SET_SIZE(pftrap)
cr2
is copied to r15
which becomes 2nd argument to trap()
(%rsi
). cr2
comes from the HW, containes the VA on which the page fault happened.
We need to dig even deeper, to see how the traps are actually born. There is a nifty tool in Solaris called trap trace. The basic idea of trap-trace is to reprogram IDT on Intel, or the tsb
register on SPARC.
ttrace enable -K
ttrace config -K
Given the amount of traps it is necessary to expand the buffer for saving the data:
ttrace config -K -b 200k
and turn it off right after the operation is done:
ttrace disable -K
One can display the buffer:
ttrace dump
or better
ttrace dump -x
that will output like this:
0 26911539fe9c8e trap e page-fault 0x7ffe60f91064
rdi: 7ffe58614110 rsi: 0 rdx: 0
rcx: 0 r8: 0 r9: ffffc1000b6f0640
rax: 41cba0 rbx: 41cba0 rbp: ffff80ffbffff030
r10: 10 r11: 246 r12: ffff80ffbffff058
r13: ffff80ffbfff5c11 r14: 0 r15: 0
ds: 0 es: 0 fs: 0
gs: 0 trp: e err: 7
rip: 7ffe60f91064 cs: 53 rfl: 10206
rsp: ffff80ffbfffef90 ss: 4b cr2: 41cbc8
^^^^^^^^^^^^^^^^^^^^^
So, the CPU is emitting weird address. How could this happen ?
The third movdqu
fails, the beginning is on 0x41cba0
, the first 2 instructions copy [0x41cba0, 0x41cbb0)
. The third copies [0x41cbb0, 0x41cbc0)
and the trap
was delivered on 0x41cbc8
, i.e. in the half of the 16 byte chunk of the buffer.
In fact, each movdqu
instruction on this setup generates 2 traps. One for the first 8 bytes, one for the last 8 bytes. The first 5 traps when the structure is copied are okay, the last one causes the problem. The following diagram highlights the addresses in these traps (again, time goes downwards):
41cb90 41cba0 41cbb0 41cbc0 41cbc8 41cbd0 41cbd8 41cc10
+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
| red zone | b u f f e r | red zone |
+--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
^
xxxxxxxxxxxxxxxxxx
^
xxxxxxxxxxxxxxxxxx
^
xxxxxxxxxxxxxxxxxx
^
xxxxxxxxxxxxxxxxxx
^
xxxxxxxxxxxxxxxxxx
^
xxxxxxxxxxxxxxxxxx
The dtrace disassembler sees movdqu
instruction and the address and it knows the instruction copies 16 byte chunks, hence the red zone overlap for the last one.
This means that internally the procesor likely works with 8 bytes and exposes this unintentionally. The CPU should be sending just one trap for each movdqu
instruction with address equal to the destination argument.
All in all, this is a CPU bug. Specifically the CPU in my test system is AMD Opteron 244:
root@va64-v20zl-prg06:~# prtconf -v |grep Opteron
value='AMD Opteron(tm) Processor 244'
In http://support.amd.com/TechDocs/48063_15h_Mod_00h-0Fh_Rev_Guide.pdf the section 675 Instructions Performing Read-Modify-Write May Alter Architectural State Before #PF talks about impact on debugger. This makes sense given that watchmalloc
is primarly debugging tool/library.
One of the updates in the bug claimed the problem also happens on systems with Intel CPUs, specifically Intel(r) Xeon(r) CPU E5345. Intel® Xeon® Processor 5300 Series (which Intel Xeon E5345 is member of) Specification Update from Dec 2010 (http://www.intel.com/Assets/PDF/specupdate/315338.pdf) contains this part of Errata which I think describes the problem:
AJ26.Premature Execution of a Load Operation Prior to Exception Handler Invocation
Problem:If any of the below circumstances occur, it is possible that the load portion of theinstruction will have executed before the exception handler is entered.•If an instruction that performs a memory load causes a code segment limit viola-tion. •If a waiting X87 floating-point (FP) instruction or MMX™ technology (MMX) instruction that performs a memory load has a floating-point exception pending. •If an MMX or SSE/SSE2/SSE3/SSSE3 extensions (SSE) instruction that performs a memory load and has either CR0.EM=1 (Emulation bit set), or a floating-point Top-of-Stack (FP TOS) not equal to 0, or a DNA exception pending.
Implication:In normal code execution where the target of the load operation is to write back memory there is no impact from the load being prematurely executed, or from therestart and subsequent re-execution of that instruction by the exception handler. If the target of the load is to uncached memory that has a system side-effect, restarting theinstruction may cause unexpected system behavior due to the repetition of the side-effect. Particularly, while CR0.TS [bit 3] is set, a MOVD/MOVQ with MMX/XMM register operands may issue a memory load before getting the DNA exception.
Workaround:Code which performs loads from memory that has side-effects can effectively workaround this behavior by using simple integer-based load instructions when accessing side-effect memory and by ensuring that all code is written such that a code segment limit violation cannot occur as a part of reading from side-effect memory.
Why did the problem become visible only in build 69 ? Looking at the previous builds, the disassembly of the __getaddrinfo()
, namely the part when the structure was being copied has changed. This happened because the ON consolidation switched to new compiler version, namely Oracle Developer Studio 12.4 in that build. Before it was using just movq
and was moving 8 bytes at a time, after the switch started using the SSE instructions that move 16 byte at a time using the xmm
registers. This supports the widely accepted theory between SREs (Site Reliability Engineers) that says that actual incident is not a result of a single failure but multiple failures have to conspire to lead to a visible event. In this case the CPU bug met with compiler change (I don't consider the fact that watchmalloc
was used a contributing failure, it is merely a factor).
Here's the piece in build 68 (before the compiler change), using regular instructions:
libc.so.1`__getaddrinfo+0x545: movq -0x80(%rbp),%rax
libc.so.1`__getaddrinfo+0x549: movq -0x78(%rbp),%rcx
libc.so.1`__getaddrinfo+0x54d: movq %rax,(%rbx)
libc.so.1`__getaddrinfo+0x550: movq %rcx,0x8(%rbx)
libc.so.1`__getaddrinfo+0x554: movq -0x70(%rbp),%rax
libc.so.1`__getaddrinfo+0x558: movq -0x68(%rbp),%rcx
libc.so.1`__getaddrinfo+0x55c: movq %rax,0x10(%rbx)
libc.so.1`__getaddrinfo+0x560: movq %rcx,0x18(%rbx)
libc.so.1`__getaddrinfo+0x564: movq -0x60(%rbp),%rax
libc.so.1`__getaddrinfo+0x568: movq -0x58(%rbp),%rcx
libc.so.1`__getaddrinfo+0x56c: movq %rax,0x20(%rbx)
libc.so.1`__getaddrinfo+0x570: movq %rcx,0x28(%rbx)
libc.so.1`__getaddrinfo+0x574: movq $0x20,%rdi
libc.so.1`__getaddrinfo+0x57b: xorq %rax,%rax
libc.so.1`__getaddrinfo+0x57e: call -0x705f3 <PLT=libc.so.1`malloc>
and in build 69 (after the compiler change), using SSE instructions:
libc.so.1`__getaddrinfo+0x50c: movdqu -0x80(%rbp),%xmm0
libc.so.1`__getaddrinfo+0x511: movdqu -0x70(%rbp),%xmm1
libc.so.1`__getaddrinfo+0x516: movdqu %xmm0,(%rbx)
libc.so.1`__getaddrinfo+0x51a: movdqu %xmm1,0x10(%rbx)
libc.so.1`__getaddrinfo+0x51f: movdqu -0x60(%rbp),%xmm0
libc.so.1`__getaddrinfo+0x524: movdqu %xmm0,0x20(%rbx)
libc.so.1`__getaddrinfo+0x529: movq $0x20,%rdi
libc.so.1`__getaddrinfo+0x530: call -0x6e815 <PLT=libc.so.1`malloc>
How to fix the problem ? Upgrade to the next version of the CPU :-) (that likely has another set of bugs, that are potentially more dangerous) Another approach would be to add errata specific solution to the kernel code (the OS has bunch of these already). Something like: detect the buggy CPU revision. When a trap occurs, check if the first argument of movdqu
matches the address reported by the trap by given offset and if yes suppress the trap. Could lead to false suppressions and would make the code less maintainable. In fact, looking at the x86 trap specific code (see the snippet above) right above the call to pr_is_watchpage()
it has a bunch of lines for Workaround for AMD erratum 100, a different AMD CPU bug.
The v20zl
identifier means this is a Sun Fire V20z machine (https://en.wikipedia.org/wiki/Sun_Fire).
Anandtech article from 2005 has some interesting details on how these systems were produced under Andy Bechtolsheim that just happened to return to Sun Microsystems:
Eighteen months ago, Andy's return to Sun marked the company's Renaissance of new ideas, strategies and philosophies. Everything about Sun is getting a new image, from company motto all the way down to the entry level server designs. Mr. Bechtolsheim is personally responsible for the company's drive towards x64 and x86 processors; the V40z, V20z and W2000z were all children of his push for lower cost servers that relied on efficiency and innovation. Given that Andy only had 6 months to get the initial Opteron servers and workstations online, there wasn't much time to develop everything from scratch, which is why the older Sun x64 workstations and servers are mostly based on Newisys designs and components.
And I actually found an image with Scott McNealy (the founder of Sun Microsystems) featured with a V20z machine.
These machines were EOL'ed in 2007 so in 2015 when the problem was identified these were a bit hard to find even internally. Reproducing the problem internally would mean getting a hold of a Sun Fire V20z machine in one the labs. Currently I found just 2 of these, and the one local to me is dead and does not have a responding Service Processor (SP). This is basically embedded computer inside the chasis that allows to remote control the server (including turning the power on/off, accessing the console, etc.). According to Jakub, our lab master, the problem with the SP on V20z is that sometimes it loses its IP configuration settings and one has to go into the lab and configure it physically, yes using the buttons on the chasis. In COVID times when everything is locked down this will have to wait. The problem can also be reproduced on Intel CPUs however I ran out of time.
So, how did the bug actually ended ? Well, it lingered in the initial state until 2021 and seeing that I just closed it as "Product/Platform Obsolete".
Is there a way how to run Netcat (or any program using getaddrinfo()
to setup a TCP wildcard socket listener) with watchmalloc
even on these buggy CPUs ? The dynamic loader ld.so.1
supports tuning hardware capabilities per process and per library (as described in the Exercising a Capability Family chapter of the venerable 'Libraries and linkers guide') using the LD_HWCAP
environment variable howewer turning off SSE instructions does not seem to be supported anymore.