vladak/solaris-layered-debugging.md

## solaris-layered-debugging.md

      
    Raw
  

              solaris-layered-debugging.md
            
          
    Debugging a problem in Solaris through several OS layers

Background

This is a story about an operating system problem that occupied me at work in 2015 for couple of days. The goal is to give you an overview of debugging techniques and associated tools. I'd like to use this opportunity (writing this article for my students of the Unix programming class at the Mathematical-physical faculty of the Charles university in Prague in 2021) to provide some historical points, some background and some details about operating system work (I feel old writing this). As a bit of a spoiler, the cause of the problem turned out to be completely different than what it seemed initially and involved traversing multiple layers in the operating system. I love this kind of problems because it is fun and great learning experience. I vaguely remember a job advertisement in Netfix bunch of years back that demanded that the applicants posses intimate knowledge of the operating system internals down to (and including) the hardware and be able to fix problems spanning these layers within a single day. Given it took me couple of days to root cause this one (with extensive help of a colleague) I guess it would be hard for me to pass :-) I presented this problem at least twice, once at MFF UK (Charles university), once at Technical University Liberec. Actually, the very first time I talked about this problem was internally during what we called "bug of the week" which was an informal meeting where anyone could present any interesting problem he was working on (solved or not) in front of a whiteboard. We've just entered 2021 and I can no longer readily reproduce the problem (more on that later) so I will have to rely on my notes so the information flow might not be fully contiguous; please bear with me.
Before we dive into the meat of the problem, let me give you somewhat broad background. The operating system in question is Solaris. Solaris 2.0 (codename Callisto, based on System V rel. 4, different to BSD OS based SunOS) was released in june 1992. I was in basic school at that point, probably playing with 16-bit computers (Atari ST) and maybe even PCs. Solaris 10 that is still supported today (till 2024 according to the current information) was released in January 2005, around the time I was finishing my master studies at the university. In 2006 I joined Sun Microsystems as a sustaining engineer, to fix bugs in security technologies in Solaris, a dream come true. I remember reading Jeff Bonwick's whitepaper from 1994 about the slab allocator (https://dl.acm.org/doi/10.5555/1267257.1267263) in high school (during lectures) and now I found myself in the same company, not very far organization wise from the engineers working on Solaris internals (that actually still had Bonwick and other "stars" on the team). This was the period of OpenSolaris inception, and in general a period of opening the Solaris operating system development to the wide world.
This article would be quite nice for that period of time (however is untimely both forward and backward wise as you will learn below).
Netcat in Solaris

In 2007 I needed Netcat in Solaris for general use and testing so after some hurdles (mostly related to internal processes and dependencies such as warn/err functions missing in libc etc.) I took the version from OpenBSD (being long time fan of the project) and integrated it into the ON (OS/Net) consolidation of Solaris (the release is comprised from bits from various consolidations). I even wrote a test suite for Netcat from scratch and delivered it into the STC consolidation so it could be run periodically (automatically for the bi-weekly build of Solaris) and/or on demand. It was based on the CTI/TET framework (not very pleasant experience) but it did the job.
In 2009, a periodic run of the test suite uncovered a bug in the in-kernel bcopy()/kcopy() implementation. The fix for this problem was delivered by an engineer from Intel. I called that "collateral salutariness" (as opposed to collateral damage) because it prevented a regression from appearing in the bits delivered to the customers that could be hard to track down and could impact other components of the system in ways that would be hard to debug.
In 2010 Sun was bought by Oracle and the ON source code was closed. The development still continues internally and I believe the Netcat implementation I integrated to ON is still part of the OpenSolaris derived systems such as Illumos. Any source code references here are taken from the OpenSolaris code right at the point before it become closed.
Naming bug... or not

In March 2015 a bug was created against Netcat in Solaris by the ON-PIT team. This is a team that runs the infrastructure to test Solaris development builds. The members of the team also provide baselining and basic root causing. This was the period when Solaris 12 was being developed (later to be reworked and named Solaris 11.4). The problem appereared to be new in build 69 and seemed to impact only x86 based system (Solaris runs on both SPARC and x86 platforms). The test suite has tens of test purposes (in CTI/TET lingo a test cases covers a distinct functionality, while a test purpose covers part of that functionality), there were also tens of failures that seemed to share the same cause so the problem seemed to be external to Netcat. Also, there were no changes to Netcat for couple of builds.
What did I do as a first thing ? Correct, tried to reproduce it !
root@va64-v20zl-prg06:~# MALLOC_DEBUG=RW LD_PRELOAD=watchmalloc.so.1 nc -l 4444
Trace/Breakpoint Trap (core dumped)

root@va64-v20zl-prg06:~# pstack core
core 'core' of 15310:   nc -l 4444
 00007ff1acf91064 __getaddrinfo () + 524
 00007ff1ad000d9f __xnet_getaddrinfo () + f
 0000000000407889 local_listen () + 39
 0000000000406c15 main () + 17c5
 0000000000405204 ???????? ()

What does this do ? Taking the environment variables aside, it runs Netcat in server mode, making it listen on TCP socket on port 4444.
In the above output we can see actually 2 debugging tools in action. The first tool is the watchmalloc library that overrides the default heap allocator in libc (by using the LD_PRELOAD environment variable as a hint to the dynamic linker and the MALLOC_DEBUG environment variable to control the behavior of the library) and that provides error checking capabilities. When the program performs an operation that violates some of the rules, it is forcibly terminated, in this case with the SIGTRAP signal.
To look at the resulting core file, second tool is used, pstack can display the stack(s) of either running process or from a core dump. The __getaddrinfo() call comes from the standard (yes, I mean Single UNIX Specification, in this case POSIX) getaddrinfo() call.
Now, the test suite runs with the watchmalloc in order to enforce more stricness of how the Netcat should deal with memory on the heap.
At this point, it might be tempting to throw the bug over the fence to the Naming team (yes, silo-ing ia a thing in OS development), however it deserves a bit more investigation, especially given that there were no changes in the getaddrinfo() in the last couple of builds.
When I run the program without the watchmalloc library, everything is fine. What does actually happen inside nc and libc when the getaddrinfo() is called ? Let's use another tool, truss, that can trace both system and library calls (like ltrace on Linux):
root@va64-v20zl-prg06:~# truss -u libc::*getaddrinfo -t so_socket,accept,bind -v bind nc -l 4444
/1:     -> libc:__xnet_getaddrinfo(0x0, 0xffff80ffbffffe6e, 0xffff80ffbffff0f0, 0xffff80ffbffff0a8, 0x30, 0x101010101010101)
/1:       -> libc:__getaddrinfo(0x0, 0xffff80ffbffffe6e, 0xffff80ffbffff0f0, 0xffff80ffbffff0a8, 0x1, 0x101010101010101)
/1:       <- libc:__getaddrinfo() = 0
/1:     <- libc:__xnet_getaddrinfo() = 0
/1:     so_socket(PF_INET6, SOCK_STREAM, IPPROTO_TCP, NULL, SOV_XPG4_2) = 3
/1:     bind(3, 0x0041FB80, 32, SOV_XPG4_2)             = 0
/1:             AF_INET6  name = ::  port = 4444
/1:             scope id = 0  source id = 0x0
/1:             flow class = 0x00  flow label = 0x00000
/1:     accept(3, 0xFFFF80FFBFFFF720, 0xFFFF80FFBFFFF17C, SOV_DEFAULT) (sleeping...)

Looks like the result of getaddrinfo() is used to populate the arguments of the socket() and bind() system calls. In this case, a IPv6 TCP wildcard socket is created (that can actually accept IPv4 connections as well).
Disassembly to source matching

From the core file we know that the crash happened in libc's __getaddrinfo+0x524. How to identify the source code line matching this offset ?
Let's try to match the C code to the disassembly. This can be done either with the dis command:
$ dis -F __getaddrinfo /lib/64/libc.so.1

or using the mdb debugger:
$ mdb core << EOF
$c
libc.so.1`__getaddrinfo+0x524::dis
EOF

Note that nc has been converted to 64-bit program some time ago, hence the /64 path component in the above command. Therefore we will be dealing solely with 64-bit environment.
The advantage of using the core file (or live program stopped at the getaddrinfo() function before it crashes) over the dis command is that the relocations have been performed and PLT (Procedure Linkage Table) entries are populated so that we can see the actual function names.
You might ask: given this problem can be reproduced at will, why not just to single step the program and identify the crashing instruction in debugger or use debug version of the libc that has source code information embedded ? These 2 conditions are quite rare in the real world, especially when dealing with kernel crashes (for which you usually obtain core file as well) so engineers are usually quite well versed in mapping disassembly to source code so why not to hone this skill further ? This is actually a critical skill for anyone who ever worked in Solaris sustaining. I used to have my table (sometimes the floor as well) at work covered with prints of disassembly annotated with C source code.
Let's go:
> libc.so.1`__getaddrinfo+0x524::dis
libc.so.1`__getaddrinfo+0x4f4:  movq   $0x30,%rdi
libc.so.1`__getaddrinfo+0x4fb:  call   -0x6e7e0 <PLT=watchmalloc.so.1`malloc>


this is malloc(0x30) (sizeof (struct addrinfo) = 0x30 = 48 bytes)

libc.so.1`__getaddrinfo+0x500:  movq   %rax,%rbx
libc.so.1`__getaddrinfo+0x503:  testq  %rbx,%rbx
libc.so.1`__getaddrinfo+0x506:  je     +0x1f8   <libc.so.1`__getaddrinfo+0x704>


check the malloc() return value, rbx now contains the address of a structure to be later identified as nai.

libc.so.1`__getaddrinfo+0x50c:  movdqu -0x80(%rbp),%xmm0


move 16 bytes at offset 0 of a structure to be later identified as aip to xmm0.

libc.so.1`__getaddrinfo+0x511:  movdqu -0x70(%rbp),%xmm1


move 16 bytes at offset 16 of aip to xmm1

libc.so.1`__getaddrinfo+0x516:  movdqu %xmm0,(%rbx)


move the contents of xmm0 to nai at offset 0

libc.so.1`__getaddrinfo+0x51a:  movdqu %xmm1,0x10(%rbx)


move the contents of xmm1 to nai offset 16

libc.so.1`__getaddrinfo+0x51f:  movdqu -0x60(%rbp),%xmm0


move 16 bytes at offset 32 of aip to xmm0

this is the last piece


libc.so.1`__getaddrinfo+0x524:  movdqu %xmm0,0x20(%rbx)    <<< SIGTRAP


move xmm0 to offset 32 of nai

libc.so.1`__getaddrinfo+0x529:  movq   $0x20,%rdi
libc.so.1`__getaddrinfo+0x530:  call   -0x6e815 <PLT=watchmalloc.so.1`malloc>


call malloc(0x20) (0x20 is sizeof (struct sockaddr_in6))

libc.so.1`__getaddrinfo+0x535:  movq   %rax,0x20(%rbx)
libc.so.1`__getaddrinfo+0x539:  testq  %rax,%rax
libc.so.1`__getaddrinfo+0x53c:  je     +0x1ba   <libc.so.1`__getaddrinfo+0x6fc>
libc.so.1`__getaddrinfo+0x542:  movq   %rax,%rdi
libc.so.1`__getaddrinfo+0x545:  movq   $0x20,%rsi


bzero() works well for the matching since __getaddrinfo() contains just 2 calls to bzero()

libc.so.1`__getaddrinfo+0x54c:  call   -0x6b2e1 <libc.so.1`bzero>
libc.so.1`__getaddrinfo+0x551:  movl   $0x20,0x10(%rbx)
libc.so.1`__getaddrinfo+0x558:  movl   $0x1a,%eax

This matches this part of _getaddrinfo() implementation in libc (Netcat is called in listener more without address specification so the hostname pointer is NULL so this makes sense):
429  	/*
430  	 * hostname is NULL
431  	 * case 1: AI_PASSIVE bit set : anyaddr 0.0.0.0 or ::
432  	 * case 2: AI_PASSIVE bit not set : localhost 127.0.0.1 or ::1
433  	 */
434  
435  	if (hostname == NULL) {
436  		struct addrinfo *nai;
437  		socklen_t addrlen;
438  		char *canonname;
439  
440  		if (aip->ai_family == PF_INET)
441  			goto v4only;
442  		/* create IPv6 addrinfo */
443  		nai = malloc(sizeof (struct addrinfo));      << __getaddrinfo+0x4fb
444  		if (nai == NULL)
445  			goto nomem;
446  		*nai = *aip;                                 << here the process got SIGTRAP !!!
447  		addrlen = sizeof (struct sockaddr_in6);
448  		nai->ai_addr = malloc(addrlen);              << __getaddrinfo+0x530
449  		if (nai->ai_addr == NULL) {
450  			freeaddrinfo(nai);
451  			goto nomem;
452  		}
453  		bzero(nai->ai_addr, addrlen);                << __getaddrinfo+0x54c
454  		nai->ai_addrlen = addrlen;
455  		nai->ai_family = PF_INET6;
456  		nai->ai_canonname = NULL;
The operation on line 446 is structure to structure copy, i.e. the aip structure is copied to the nai structure, so in the end they should match byte by byte. In our case this operation never finished because of the SIGTRAP.
Looking at the source of the copy operation, i.e. the aip structure:
<rbp-0x80=K
                ffff80ffbfffefc0 
> ffff80ffbfffefc0::print struct addrinfo
{
    ai_flags = 0x8
    ai_family = 0
    ai_socktype = 0x2
    ai_protocol = 0x6
    ai_addrlen = 0
    ai_canonname = 0
    ai_addr = 0
    ai_next = 0
}
> ::sizeof struct addrinfo
sizeof (struct addrinfo) = 0x30

The 4 bytes values at the start of the buffer are:

8 is ai_flags, this is AI_PASSIVE
0 is ai_family, not being set yet
2 is ai_socktype, this is SOCK_STREAM
6 is ai_protocol, this is tcp

So this indeed looks like a valid addrinfo structure:
struct addrinfo {
        int ai_flags;           /* AI_PASSIVE, AI_CANONNAME, ... */
        int ai_family;          /* PF_xxx */
        int ai_socktype;        /* SOCK_xxx */
        int ai_protocol;        /* 0 or IPPROTO_xxx for IPv4 and IPv6 */
#ifdef __sparcv9
        int _ai_pad;            /* for backwards compat with old size_t */
#endif /* __sparcv9 */
        socklen_t ai_addrlen;
        char *ai_canonname;     /* canonical name for hostname */
        struct sockaddr *ai_addr;       /* binary address */
        struct addrinfo *ai_next;       /* next structure in linked list */
};

How does the debugger actually know the offsets of the members of the structures ? In this case it uses so called CTF data that is attached to the libc binary during the build (C-Type Format compact representation of the data types used for building the binary, used also for the kernel) however there are other debugging metadata types such as DWARF (in fact in some cases the CTF is built from DWARF on Solaris).
Let's take a look at the rbx register because this is the malloc() return value, i.e. the nai structure:
> <rbx=K
                41cba0   

and display the contents of this buffer (20X means display 20 bytes in 4 byte quantities) at the point of the crash:
> 41cba0/20X
0x41cba0:       8               0               2               6               0               0               0               0               baddcafe        
                baddcafe        baddcafe        baddcafe        30              0               baddcafe        baddcafe        41cc10          0               
                deadbeef        deadbeef        

SIGTRAP was delivered when it was writing 16 bytes on 0x41cba0 + 0x20 = 0x41cbc0 , however the structure does not end there since the allocator returns a buffer of 48 bytes; these 4 baddcafe patterns should be writeable. The watchmalloc(3MALLOC) man page explains this pattern:

The malloc() function returns with the allocated memory  filled with the pattern 0xbaddcafe as a precaution against applications incorrectly expecting to  receive  back  unmodified  memory  from  the  last free().

I.e. the first 32 bytes were already copied. The last 16 bytes were not because the buffer still has these patterns. Here is how the buffer got populated, time going downwards:
 41cba0            41cbb0            41cbc0           41cbd0
   +--------+--------+--------+--------+--------+--------+
   |        |        |        |        |        |        |
   +--------+--------+--------+--------+--------+--------+

   ^^^^^^^^^^^^^^^^^^
   8 0 2 6

                      ^^^^^^^^^^^^^^^^^
	              zeroes	      
		                        ^^^^^^^^^^^^^^^^^^
					unfinished (SIGTRAP)

The 30 0 bytes following these baddcafe words are heap allocator (in this case watchmalloc) matadata. More on that below.
ASLR

Debugging with ever changing pointer is tedious. The heap pointer changes because of ASLR (Address Space Layout Randomization). In fact it changes the location of heapu, stack, anon segment, libraries (text/data stay put).
By default ASLR is turned on only for binaries that contain certain pieces of information recorded by the link-editor that they are to be run with ASLR. The information is stored in the .dynamic section of the ELF file. This can be seen with the elfdump command (another debugging tool, provides similar functionality like readelf/objdump on Linuxu):
$ elfdump -d /usr/bin/nc | grep -i aslr 
     [37]  SUNW_ASLR       0x2        ENABLE

How to disable it ? This functionality is part of the Solaris Security extensions framework that can be controlled using the sxadm command, e.g. this disables this for single run of a program:
sxadm exec -s aslr=disable /usr/bin/pmap self

It is certainly possible to disable it system wide:
    # sxadm disable aslr
    # sxadm status
    EXTENSION           STATUS                        CONFIGURATION
    aslr                disabled                      disabled
    nxstack             enabled (all)                 default (default)
    nxheap              enabled (tagged-files)        default (default)

This is a bit too hard core, is is sufficient to run a shell with ASLR disabled and the setting will be inherited to the child processes:
      sxadm exec -s aslr=disable /bin/bash

It should be also possible to edit the ELF binary itself (perhaps using the elfedit tool, I haven't checked).
How does watchmalloc work ?

watchmalloc is a simple heap allocator. It delivers the same overall functionality as the default heap allocator in libc however provides means to enforce correctness. For example, it crashes the application whenever it tries to write to free'd memory. It also employs red zones that write protect allocator metadata. The way a heap allocator works is that it requests memory from the kernel and then divides and conquers this memory. Each buffer managed by the heap allocator and handed out to the caller (e.g. via malloc()/calloc() etc.) has a header carrying the metadata. The metadata is used for the buffer management (e.g. connecting the buffer to the data structure used by the allocator, for coalescing of the buffers, ...) and misused by heap overflow exploits.
When watchmalloc is run with WATCH or RW flag, it will request kernel to protect the buffer headers carrying the metadata. Let's take a look at how this is done.
The main data structures used:
/*
 * All of our allocations will be aligned on the least multiple of 4,
 * at least, so the two low order bits are guaranteed to be available.
 */
#ifdef _LP64
#define	ALIGN		16
#else
#define	ALIGN		8
#endif

/* the proto-word; size must be ALIGN bytes */
typedef union _w_ {
        size_t          w_i;            /* an unsigned int */
        struct _t_      *w_p[2];        /* two pointers */
} WORD;

/* structure of a node in the free tree */
typedef struct _t_ {
        WORD    t_s;    /* size of this element */
        WORD    t_p;    /* parent node */
        WORD    t_l;    /* left child */
        WORD    t_r;    /* right child */
        WORD    t_n;    /* next in link list */
        WORD    t_d;    /* dummy to reserve space for self-pointer */
} TREE;
The size of WORD in 64-bit program is 16 bytes. The free list is normally represented by self balancing binary tree.
For small buffers (less than MINSIZE which is defined as sizeof (TREE) - WORDSIZE and equals to 96 - 16 = 80 bytes in 64-bit program) malloc() calls smalloc() as a fast path.
I added couple of debug prints to smalloc() (could have traced the watchmalloc library calls with truss however the LD_PRELOAD of watchmalloc would affect the heap allocations performed by the truss itself) to print the addresses in the main path of the function:
/*
 * Keep lists of small blocks, LIFO order.
 */
static	TREE	*List[MINSIZE/WORDSIZE-1];
static	TREE	*Last[MINSIZE/WORDSIZE-1];

...

static void *
smalloc(size_t size)
{
	TREE	*tp;
	size_t	i;
 
 ...
 
 
	/* list to use */
	i = size / WORDSIZE - 1;
 
 ...
 
 /* allocate from the head of the queue */
	tp = List[i];
	unprotect(tp);
	if ((List[i] = AFTER(tp)) == NULL)
		Last[i] = NULL;
	copy_pattern(LIVEPAT, tp);  // LIVEPAT = 0xbaddcafebaddcafeULL
	SETBIT0(SIZE(tp));  // marks the buffer as 'busy', #define	SIZE(b)		(((b)->t_s).w_i)
	protect(tp);
	return (DATA(tp)); // #define	DATA(b)		(((char *)(b)) + WORDSIZE)
}
In our case the debug prints yield:
  UNPROTECT 41cb90 (16)
SMALLOC 41cb90 41cba0 49
  PROTECT 41cb90 16

The allocation requested 48 bytes. The tp is 41cb90, smalloc() returned 41cba0, which is 16 bytes further.
Only the metadata header is therefore protected. This is also mentioned in the wathmalloc(3MALLOC) man page:

   A  header  is  maintained  before  each block of allocated memory. Each
  header is covered with a watched area, thereby  providing  a  red  zone
  before  and  after  each  block of allocated memory (the header for the
  subsequent memory block serves as the trailing red zone for its preced-
  ing  memory  block).  Writing  just before or just after a memory block
  returned by malloc() will trigger a watchpoint trap.


However, how come the smalloc() actually returns the memory of the TREE structure ? This is simply a trick used by the allocator. For these small buffers only the first member of the TREE structure matters. The rest is used for the data of the buffer. That's why the MINSIZE is computed as sizeof (TREE) - sizeof (WORD). Antoher interesting trick is the use of lower bits in the address to store additional metadata like the used buffer indication. This is possible due to the buffers returned from malloc() being aligned and therefore having low bits always zero.
The write protection is something the program cannot do by itself; it needs kernel to do it. How exactly does watchmalloc request the header protection from the kernel ? It uses the procfs pseudo file system in protect(tp).
The protect() call does this:
protect(TREE *tp)

...

    ctl.cmd = PCWATCH;
    ctl.prwatch.pr_vaddr = (uintptr_t)tp;
    ctl.prwatch.pr_size = size + WORDSIZE;
    ctl.prwatch.pr_wflags = wflags;
    (void) write(ctlfd, &ctl, sizeof (ctl));
where ctlfd is file descriptor from open of /proc/<pid>/ctl which can be seen in mdb via the ::pfiles d-command or via the pfiles(1) program (here we can see the former in action):
> ffffc10007005160::pfiles
FD   TYPE            VNODE INFO
   0  CHR ffffc1000aad2740 /dev/pts/1 
   1  CHR ffffc1000aad2740 /dev/pts/1 
   2  CHR ffffc1000aad2740 /dev/pts/1 
   3 PROC ffffc106e5bc5340 /proc/15451/ctl (proc=ffffc10007005160)

What does procfs do with the request in the kernel ? pcontrol.c#pr_watch():
    pwa = kmem_alloc(sizeof (struct watched_area), KM_SLEEP);
    pwa->wa_vaddr = (caddr_t)vaddr;
    pwa->wa_eaddr = (caddr_t)vaddr + size;
    pwa->wa_flags = (ulong_t)wflags;

    error = ((pwa->wa_flags & ~WA_TRAPAFTER) == 0)?
        clear_watched_area(p, pwa) : set_watched_area(p, pwa);
set_watched_area() does this:
    target = pr_find_watched_area(p, pwa, &where);

...

            avl_insert(&p->p_warea, pwa, where);
            error = set_watched_page(p, vaddr, eaddr, flags, 0);

p is proc_t pointer (this is a giant kernel structure representing a process)
pwa is a structure saved into an AVL tree in proc_t

the tree is created in getproc() like so:


    /*
     * Initialize watchpoint structures
     */
    avl_create(&cp->p_warea, wa_compare, sizeof (struct watched_area),
        offsetof(struct watched_area, wa_link));


and wa_compare() is the comparison function based on the start of the protected memory chunk:

    if (pa->wa_vaddr < pb->wa_vaddr)
            return (-1);
    else if (pa->wa_vaddr > pb->wa_vaddr)
            return (1);
    else
            return (0);
The granularity of the protection is a page, i.e. every page that has at least one piece of protected metadata is set as write-protected and whenever the process performs write it will cause a trap. This is one of the reasons why watchmalloc incurs significant performance penalty. Given that there will be most likely other (legitimate) writes to different locations of a page that has at least one watchmalloc metadata buffer and therefore is write protected, the lookup whether the write falls into one of the protected areas within the page needs to be quick. Hence the AVL tree. The watchmalloc man page talks about making the program 10 to 1000 times slower, depending on its allocation pattern.
The trap is handled in trap() as a page-fault (#PF). Here is x86 specific trap handling code:
746  		ASSERT(!(curthread->t_flag & T_WATCHPT));
747  		watchpage = (pr_watch_active(p) && pr_is_watchpage(addr, rw));
748  		vaddr = addr;
749  		if (!watchpage || (sz = instr_size(rp, &vaddr, rw)) <= 0)
750  			fault_type = (errcode & PF_ERR_PROT)? F_PROT: F_INVAL;
751  		else if ((watchcode = pr_is_watchpoint(&vaddr, &ta,
752  		    sz, NULL, rw)) != 0) {
753  			if (ta) {
754  				do_watch_step(vaddr, sz, rw,
755  				    watchcode, rp->r_pc);
756  				fault_type = F_INVAL;
757  			} else {
758  				bzero(&siginfo, sizeof (siginfo));
759  				siginfo.si_signo = SIGTRAP;
760  				siginfo.si_code = watchcode;
761  				siginfo.si_addr = vaddr;
762  				siginfo.si_trapafter = 0;
763  				siginfo.si_pc = (caddr_t)rp->r_pc;
764  				fault = FLTWATCH;
765  				break;
766  			}
767  		} else {
768  			/* XXX pr_watch_emul() never succeeds (for now) */
769  			if (rw != S_EXEC && pr_watch_emul(rp, vaddr, rw))
770  				goto out;
771  			do_watch_step(vaddr, sz, rw, 0, 0);
772  			fault_type = F_INVAL;
773  		}
774  
775  		res = pagefault(addr, fault_type, rw, 0);
It is therefore necessary to find out whether given memory chunk accessed by the movdqu instruction matches one of the nodes in the AVL tree.
Some interesting tidbits:

instr_size() calls the dtrace disassembler to determine how big is the memory chunk is being read/written by the instruction

pr_is_watchpoint() then finds out whether the address on which the trap happened (vaddr) matches (sz) one of the tree nodes


At this point my hypothesis was that this is a bug in watchmalloc or procfs.
To confirm it, we will need to look at the AVL tree nodes to see the list of protected memory chunks.
Again, thanks to ASLR being off, the addresses will remain the same across multiple program executions.
Instead of getting SIGTRAP I will break the program execution so that the in-kernel structures can be inspected:
> ::pgrep nc
S    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME
R  15451  15418  15451  15305      0 0x4a004000 ffffc10007005160 nc

> ffffc10007005160::print -a proc_t p_warea
ffffc10007005348 p_warea = {
    ffffc10007005348 p_warea.avl_root = 0xffffc106e5bbbb60
    ffffc10007005350 p_warea.avl_compar = wa_compare
    ffffc10007005358 p_warea.avl_offset = 0
    ffffc10007005360 p_warea.avl_numnodes = 0x103
    ffffc10007005368 p_warea.avl_size = 0x30
}
> ffffc10007005160::print -a proc_t p_warea | ::walk avl | ::print struct watched_area wa_vaddr wa_eaddr ! head
wa_vaddr = 0x41cb40
wa_eaddr = 0x41cb50
wa_vaddr = 0x41cb50
wa_eaddr = 0x41cb60
wa_vaddr = 0x41cb90
wa_eaddr = 0x41cba0
wa_vaddr = 0x41cbd0
wa_eaddr = 0x41cc10
wa_vaddr = 0x41cc10
wa_eaddr = 0x41cc50

This is the relevant part of the output:
wa_eaddr = 0x41cb60
wa_vaddr = 0x41cb90
wa_eaddr = 0x41cba0
wa_vaddr = 0x41cbd0
wa_eaddr = 0x41cc10

The protected memory region is [0x41cb90, 0x41cba0) (which matches the debug output of smalloc() above) and the next one is [0x41cbd0, 0x41cc10). This means that data buffer [0x41cba0, 0x41cbd0) which is 48 bytes is not protected.
Here is how it looks like in memory:
  41cb90           41cba0            41cbb0            41cbc0            41cbd0           41cc10
   +--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
   |  red      zone  |                 b  u  f  f  e  r                    |   red     zone  |
   +--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+

This means this is not a fault of either procfs or watchmalloc.
User registers

This was the moment where I engaged my colleague Tomas Jedlicka (who used to be actually one of my students in the Unix programming class at MFF UK; now works at Apple I believe) sitting in same area (these were the times before COVID) who was working as Solaris kernel developer at that point (and before that he was a member of the Solaris naming team responsible for implementations of technologies such as DNS or LDAP and also APIs like getaddrinfo(), so that was perfect fit :-)).
How to verify that at given moment rbx contained indeed 0x41cba0 ? (i.e. to make sure the address did not get garbled on the way) Let's have a look at the kernel thread stack for the LWP (Light Weight Process) in given process:
> ::pgrep nc
S    PID   PPID   PGID    SID    UID      FLAGS             ADDR NAME
R  15451  15418  15451  15305      0 0x4a004000 ffffc10007005160 nc
> ffffc10007005160::walk thread | ::findstack
stack pointer for thread ffffc10005f89240: fffffffc810e7d20
[ fffffffc810e7d20 _resume_from_idle+0x14f() ]
  fffffffc810e7d50 swtch+0x165()
  fffffffc810e7dd0 stop+0x7d4()
  fffffffc810e7e10 stop_on_fault+0xa6()
  fffffffc810e7ef0 trap+0x1369()
  fffffffc810e7f00 0xfffffffffb8001a7()

Need to look into the trap frame. After the page fault the HW saved the registers to stack:
> fffffffc810e7ef0/10K
0xfffffffc810e7ef0:             fffffffc810e7f00 fffffffffb8001a7 0               7ff1acf91064    7ff1a55f4110    0               0               0               
                0               ffffc10005f89240 
> fffffffc810e7f00::print struct regs
{
    r_savfp = 0
    r_savpc = 0x7ff1acf91064
    r_rdi = 0x7ff1a55f4110
    r_rsi = 0
    r_rdx = 0
    r_rcx = 0
    r_r8 = 0
    r_r9 = 0xffffc10005f89240
    r_rax = 0x41cba0
    r_rbx = 0x41cba0                 << yep, this is it !
    r_rbp = 0xffff80ffbffff040
    r_r10 = 0x10
    r_r11 = 0x246
    r_r12 = 0xffff80ffbffff068
    r_r13 = 0xffff80ffbfff5c11
    r_r14 = 0
    r_r15 = 0
    __r_fsbase = 0
    __r_gsbase = 0
    r_ds = 0
    r_es = 0
    r_fs = 0
    r_gs = 0
    r_trapno = 0xe                 << this is $14, it matches the page fault trap (uts/intel/ia32/ml/exceptions.s)
    r_err = 0x7
    r_rip = 0x7ff1acf91064
    r_cs = 0x53
    r_rfl = 0x10206
    r_rsp = 0xffff80ffbfffefa0
    r_ss = 0x4b
}

dtrace

If procfs or the watch point infrastructure itself is not at fault, we need to look further down, at the addresses passed to pr_is_watchpoint(). The ideal tool for this is dtrace, a dynamic instrumentation tool (I sort of hesitate to call it a tool because it opens so many possibilities to observe the system and therefore allows to ask questions that would not be possible to answer with other tools), specifically the FBT (Function Boundary Tracing provider) using this one liner:
$ dtrace -n 'fbt::trap:entry/execname == "nc"/ { printf("%p", arg1); self->t = 1; } fbt::pr_is_watchpoint:entry/execname == "nc" && self->t/ { printf("%p %ld", *((caddr_t *)arg0), arg2); } fbt::pr_is_watchpoint:return/self->t/ { trace(arg1); } fbt::trap:return/self->t/ { self->t = 0; }'

This allows to trace kernel function calls that are result of specific userland function calls. In this case we only want to trace calls to trap that happened from within a nc program (could as well used a specific PID however on the test system I was sure no one else would run nc). Next, I wanted to see the arguments of the pr_is_watchpoint() call that happened from within the call tree of that specific trap call.
Looking at the output from dtrace the last call to pr_is_watchpoint() is:
   0  17077                       trap:entry 0
   0  17077                       trap:entry ffff4ff0
   0  17077                       trap:entry 41cbc8
   0  33469           pr_is_watchpoint:entry 41cbc8 16
   0  33470          pr_is_watchpoint:return                 4

But rbx was 41cba0, this is 40 bytes away. The 16 in the pr_is_watchpoint:entry probe is the size of the data in bytes, i.e. the width of the instruction argument. This makes sense given that movdqu copies 16 byte buffers.
Our buffer is [0x41cba0, 0x41cbd0) , i.e. 0x41cbc8+0t16 (0t is prefix used in mdb for decimal number representation) is 0x41cbd8, this is indeed beyond the boundary of the buffer. In fact it falls into the red zone (i.e. write protected) of the next buffer. From the AVL tree dump above we know the protected areas:
  41cb90           41cba0            41cbb0            41cbc0   41cbc8   41cbd0    41cbd8   41cc10
   +--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
   |  red      zone  |                 b  u  f  f  e  r                    |   red     zone  |
   +--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
                                                                   xxxxxxxxxxxxxxxxx

The dtrace output tells us that the return value from pr_is_watchpoint() is 4, i.e.:
155 #define      TRAP_WWATCH     4       /* write access watchpoint trap */

Simply, to the watchpoint code this looks like as if there was an instruction trying to write 16 bytes at 0x41cbc8. Given this matches a protected area, it has to fail. From the disassembly we know that the actual instruction was in fact writing 16 bytes at 41cbc0.
Where did the VA come from ? It is being passed to the call to trap():
461 trap(struct regs *rp, caddr_t addr, processorid_t cpuid)

How is trap() being called ? From the table defined in uts/intel/ia32/ml/exception.s:
1054    /*
1055     * #PF
1056     */
1057    ENTRY_NP(pftrap)
1058    TRAP_ERR(T_PGFLT)       /* $14 already have error code on stack */
1059    INTR_PUSH
1060#if defined(__xpv)
1061
1062#if defined(__amd64)
1063    movq    %gs:CPU_VCPU_INFO, %r15
1064    movq    VCPU_INFO_ARCH_CR2(%r15), %r15  /* vcpu[].arch.cr2 */
1065#elif defined(__i386)
1066    movl    %gs:CPU_VCPU_INFO, %esi
1067    movl    VCPU_INFO_ARCH_CR2(%esi), %esi  /* vcpu[].arch.cr2 */
1068#endif      /* __i386 */
1069
1070#else       /* __xpv */
1071
1072#if defined(__amd64)
1073    movq    %cr2, %r15
1074#elif defined(__i386)
1075    movl    %cr2, %esi
1076#endif      /* __i386 */
1077
1078#endif      /* __xpv */
1079    jmp     cmntrap_pushed
1080    SET_SIZE(pftrap)
cr2 is copied to r15 which becomes 2nd argument to trap() (%rsi). cr2 comes from the HW, containes the VA on which the page fault happened.
Trap tracing

We need to dig even deeper, to see how the traps are actually born. There is a nifty tool in Solaris called trap trace. The basic idea of trap-trace is to reprogram IDT on Intel, or the tsb register on SPARC.
  ttrace enable -K
  ttrace config -K

Given the amount of traps it is necessary to expand the buffer for saving the data:
  ttrace config -K -b 200k

and turn it off right after the operation is done:
  ttrace disable -K

One can display the buffer:
  ttrace dump

or better
    ttrace dump -x

that will output like this:
  0  26911539fe9c8e trap e   page-fault        0x7ffe60f91064
         rdi:     7ffe58614110 rsi:                0 rdx:                0
         rcx:                0  r8:                0  r9: ffffc1000b6f0640
         rax:           41cba0 rbx:           41cba0 rbp: ffff80ffbffff030
         r10:               10 r11:              246 r12: ffff80ffbffff058
         r13: ffff80ffbfff5c11 r14:                0 r15:                0
          ds:                0  es:                0  fs:                0
          gs:                0 trp:                e err:                7
         rip:     7ffe60f91064  cs:               53 rfl:            10206
         rsp: ffff80ffbfffef90  ss:               4b cr2:           41cbc8
                                                     ^^^^^^^^^^^^^^^^^^^^^

So, the CPU is emitting weird address. How could this happen ?
The third movdqu fails, the beginning is on 0x41cba0, the first 2 instructions copy [0x41cba0, 0x41cbb0). The third copies [0x41cbb0, 0x41cbc0) and the trap was delivered on 0x41cbc8, i.e. in the half of the 16 byte chunk of the buffer.
In fact, each movdqu instruction on this setup generates 2 traps. One for the first 8 bytes, one for the last 8 bytes. The first 5 traps when the structure is copied are okay, the last one causes the problem. The following diagram highlights the addresses in these traps (again, time goes downwards):
  41cb90           41cba0            41cbb0            41cbc0   41cbc8   41cbd0    41cbd8   41cc10
   +--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
   |  red      zone  |                 b  u  f  f  e  r                    |   red     zone  |
   +--------+--------+--------+--------+--------+--------+--------+--------+--------+--------+
                     ^                 
		     xxxxxxxxxxxxxxxxxx
		              ^
			      xxxxxxxxxxxxxxxxxx
			               ^
				       xxxxxxxxxxxxxxxxxx
				                 ^
						 xxxxxxxxxxxxxxxxxx
						          ^
							  xxxxxxxxxxxxxxxxxx
							           ^
								   xxxxxxxxxxxxxxxxxx

The dtrace disassembler sees movdqu instruction and the address and it knows the instruction copies 16 byte chunks, hence the red zone overlap for the last one.
This means that internally the procesor likely works with 8 bytes and exposes this unintentionally. The CPU should be sending just one trap for each movdqu instruction with address equal to the destination argument.
CPU bug

All in all, this is a CPU bug. Specifically the CPU in my test system is AMD Opteron 244:
root@va64-v20zl-prg06:~# prtconf -v |grep Opteron
                    value='AMD Opteron(tm) Processor 244'

In http://support.amd.com/TechDocs/48063_15h_Mod_00h-0Fh_Rev_Guide.pdf the section 675 Instructions Performing Read-Modify-Write May Alter Architectural State Before #PF talks about impact on debugger. This makes sense given that watchmalloc is primarly debugging tool/library.
One of the updates in the bug claimed the problem also happens on systems with Intel CPUs, specifically Intel(r) Xeon(r) CPU           E5345. Intel® Xeon® Processor 5300 Series (which Intel Xeon E5345 is member of) Specification Update from Dec 2010 (http://www.intel.com/Assets/PDF/specupdate/315338.pdf) contains this part of Errata which I think describes the problem:

AJ26.Premature Execution of a Load Operation Prior to Exception Handler Invocation


Problem:If  any  of  the  below  circumstances  occur,  it  is  possible  that  the  load  portion  of  theinstruction will have executed before the exception handler is entered.•If an instruction that performs a memory load causes a code segment limit viola-tion. •If a waiting X87 floating-point (FP) instruction or MMX™ technology (MMX) instruction that performs a memory load has a floating-point exception pending. •If an MMX or SSE/SSE2/SSE3/SSSE3 extensions (SSE) instruction that performs a memory load and has either CR0.EM=1 (Emulation bit set), or a floating-point Top-of-Stack (FP TOS) not equal to 0, or a DNA exception pending.


Implication:In  normal  code  execution  where  the  target  of  the  load  operation  is  to  write  back memory  there  is  no  impact  from  the  load  being  prematurely  executed,  or  from  therestart and subsequent re-execution of that instruction by the exception handler. If the target of the load is to uncached memory that has a system side-effect, restarting theinstruction  may  cause  unexpected  system  behavior  due  to  the  repetition  of  the  side-effect. Particularly, while CR0.TS [bit 3] is set, a MOVD/MOVQ with MMX/XMM register operands may issue a memory load before getting the DNA exception.


Workaround:Code which performs loads from memory that has side-effects can effectively workaround this behavior by using simple integer-based load instructions when accessing side-effect memory and by ensuring that all code is written such that a code segment limit violation cannot occur as a part of reading from side-effect memory.

Closing remarks

Why did the problem become visible only in build 69 ? Looking at the previous builds, the disassembly of the __getaddrinfo(), namely the part when the structure was being copied has changed. This happened because the ON consolidation switched to new compiler version, namely Oracle Developer Studio 12.4 in that build. Before it was using just movq and was moving 8 bytes at a time, after the switch started using the SSE instructions that move 16 byte at a time using the xmm registers. This supports the widely accepted theory between SREs (Site Reliability Engineers) that says that actual incident is not a result of a single failure but multiple failures have to conspire to lead to a visible event. In this case the CPU bug met with compiler change (I don't consider the fact that watchmalloc was used a contributing failure, it is merely a factor).
Here's the piece in build 68 (before the compiler change), using regular instructions:
libc.so.1`__getaddrinfo+0x545:  movq   -0x80(%rbp),%rax
libc.so.1`__getaddrinfo+0x549:  movq   -0x78(%rbp),%rcx
libc.so.1`__getaddrinfo+0x54d:  movq   %rax,(%rbx)
libc.so.1`__getaddrinfo+0x550:  movq   %rcx,0x8(%rbx)
libc.so.1`__getaddrinfo+0x554:  movq   -0x70(%rbp),%rax
libc.so.1`__getaddrinfo+0x558:  movq   -0x68(%rbp),%rcx
libc.so.1`__getaddrinfo+0x55c:  movq   %rax,0x10(%rbx)
libc.so.1`__getaddrinfo+0x560:  movq   %rcx,0x18(%rbx)
libc.so.1`__getaddrinfo+0x564:  movq   -0x60(%rbp),%rax
libc.so.1`__getaddrinfo+0x568:  movq   -0x58(%rbp),%rcx
libc.so.1`__getaddrinfo+0x56c:  movq   %rax,0x20(%rbx)
libc.so.1`__getaddrinfo+0x570:  movq   %rcx,0x28(%rbx)
libc.so.1`__getaddrinfo+0x574:  movq   $0x20,%rdi
libc.so.1`__getaddrinfo+0x57b:  xorq   %rax,%rax
libc.so.1`__getaddrinfo+0x57e:  call   -0x705f3 <PLT=libc.so.1`malloc>

and in build 69 (after the compiler change), using SSE instructions:
libc.so.1`__getaddrinfo+0x50c:  movdqu -0x80(%rbp),%xmm0
libc.so.1`__getaddrinfo+0x511:  movdqu -0x70(%rbp),%xmm1
libc.so.1`__getaddrinfo+0x516:  movdqu %xmm0,(%rbx)
libc.so.1`__getaddrinfo+0x51a:  movdqu %xmm1,0x10(%rbx)
libc.so.1`__getaddrinfo+0x51f:  movdqu -0x60(%rbp),%xmm0
libc.so.1`__getaddrinfo+0x524:  movdqu %xmm0,0x20(%rbx)
libc.so.1`__getaddrinfo+0x529:  movq   $0x20,%rdi
libc.so.1`__getaddrinfo+0x530:  call   -0x6e815 <PLT=libc.so.1`malloc>

How to fix the problem ? Upgrade to the next version of the CPU :-) (that likely has another set of bugs, that are potentially more dangerous) Another approach would be to add errata specific solution to the kernel code (the OS has bunch of these already). Something like: detect the buggy CPU revision. When a trap occurs, check if the first argument of movdqu matches the address reported by the trap by given offset and if yes suppress the trap. Could lead to false suppressions and would make the code less maintainable. In fact, looking at the x86 trap specific code (see the snippet above) right above the call to pr_is_watchpage() it has a bunch of lines for Workaround for AMD erratum 100, a different AMD CPU bug.
The v20zl identifier means this is a Sun Fire V20z machine (https://en.wikipedia.org/wiki/Sun_Fire).
Anandtech article from 2005 has some interesting details on how these systems were produced under Andy Bechtolsheim that just happened to return to Sun Microsystems:

Eighteen months ago, Andy's return to Sun marked the company's Renaissance of new ideas, strategies and philosophies. Everything about Sun is getting a new image, from company motto all the way down to the entry level server designs.
Mr. Bechtolsheim is personally responsible for the company's drive towards x64 and x86 processors; the V40z, V20z and W2000z were all children of his push for lower cost servers that relied on efficiency and innovation.
Given that Andy only had 6 months to get the initial Opteron servers and workstations online, there wasn't much time to develop everything from scratch, which is why the older Sun x64 workstations and servers are mostly based on Newisys designs and components.

And I actually found an image with Scott McNealy (the founder of Sun Microsystems) featured with a V20z machine.
These machines were EOL'ed in 2007 so in 2015 when the problem was identified these were a bit hard to find even internally. Reproducing the problem internally would mean getting a hold of a Sun Fire V20z machine in one the labs. Currently I found just 2 of these, and the one local to me is dead and does not have a responding Service Processor (SP). This is basically embedded computer inside the chasis that allows to remote control the server (including turning the power on/off, accessing the console, etc.). According to Jakub, our lab master, the problem with the SP on V20z is that sometimes it loses its IP configuration settings and one has to go into the lab and configure it physically, yes using the buttons on the chasis. In COVID times when everything is locked down this will have to wait. The problem can also be reproduced on Intel CPUs however I ran out of time.
So, how did the bug actually ended ? Well, it lingered in the initial state until 2021 and seeing that I just closed it as "Product/Platform Obsolete".
Is there a way how to run Netcat (or any program using getaddrinfo() to setup a TCP wildcard socket listener) with watchmalloc even on these buggy CPUs ? The dynamic loader ld.so.1 supports tuning hardware capabilities per process and per library (as described in the Exercising a Capability Family chapter of the venerable 'Libraries and linkers guide') using the LD_HWCAP environment variable howewer turning off SSE instructions does not seem to be supported anymore.