Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
The life of an XNU unix syscall on amd64

XNU syscall path

Chart

             +------------------+
             |These push their  |                                  +-----------------------+
             |respective syscall|                                  |This overwrites the    |
             |dispatch functions|                                  |saved dispatch function|
             |onto the stack    |                                  |with hndl_alltraps     |
             +--------+---------+                                  +-----------+-----------+
                      |                                                        |
                      v                                                        v

+--------+    +----------------+            +-------------------+    +--------------------+
|int 0x80|--->|idt64_unix_scall|-+-+-+----->|L_32bit_entry_check|--->|L_64bit_entry_reject|-+
+--------+    +----------------+ ^ ^ ^      +-------------------+    +--------------------+ |
                                 | | |                                                      |
+--------+    +----------------+ | | |   +--------------------------------------------------+
|int 0x81|--->|idt64_mach_scall|-+ | |   |
+--------+    +----------------+   | |   |  +--------------+  +----------------+  +-----------------+
                                   | |   +->|L_dispatch_U64|->|L_dispatch_64bit|->|L_common_dispatch|-+
+--------+    +----------------+   | |      +--------------+  +----------------+  +-----------------+ |
|int 0x82|--->|idt64_mdep_scall|---+ |                                                                |
+--------+    +----------------+     |   +------------------------------------------------------------+
                                     |   |
+--------+    +-------------+        |   |  +-------------+   +---------+
|sysenter|--->|hi64_sysenter|--------+   +->|hndl_alltraps|-->|user_trap|
+--------+    +-------------+               +-------------+   +---------+


+--------+    +-------------+    +--------------+    +----------------+    +-----------------+
|syscall |--->|hi64_syscall |--->|L_dispatch_U64|--->|L_dispatch_64bit|--->|L_common_dispatch|-+
+--------+    +-------------+    +--------------+    +----------------+    +-----------------+ |
                                                                                               |
                                                                       +-----------------------+
                                                                       |
                                                                 +-----v------+
                                                                 |hndl_syscall|
                                                                 +-+--+--+--+-+
                                                                   |  |  |  |
                       +-------------------------------------------+  |  |  |
                       |                                              |  |  |
                       |                      +-----------------------+  |  +---------------+
                       |                      |                          |                  |
               +-------v---------+    +-------v---------+    +-----------v-----+    +-------v---------+
               |hndl_unix_scall64|    |hndl_mach_scall64|    |hndl_mdep_scall64|    |hndl_diag_scall64|
               +-------+---------+    +-------+---------+    +-------+---------+    +-------+---------+
                       |                      |                      |                      |
                       |                      |                      |                      |
               +-------v------+       +-------v----------+   +-------v---------+       +----v-----+
               |unix_syscall64|       |mach_call_munger64|   |machdep_syscall64|       |diagCall64|
               +--------------+       +------------------+   +-----------------+       +----------+

The question

A while ago when starting to audit xnu syscalls, I noticed something kind of funny and wanted to track it down. To preface, everything here is specific only to xnu 10.11.2 on amd64, though may apply to other architectures. Additionally, my kernel debugger is currently broken (thanks VMWare!), so take this with a grain of salt as it's not been verified. Please let me know if anything is mistaken or unclear.

Let's use the exit() syscall as an example. Exit is defined in xnu/bsd/kern/kern_exit.c as:

void exit(proc_t p, struct exit_args *uap, int *retval)

Where p is the process executing the syscall, uap is a pointer to a struct containing the user args, and retval is a pointer that will contain the result of the syscall. However, this seems kind of odd -- OS X uses the SystemV ABI everywhere, including syscalls and this means the syscall arguments are passed in registers (rdi, rsi, rdx, rcx, r8, r9) with the syscall number in rax. This raises an obvious question: where do these values get moved from registers to memory, and where is that memory located (userspace vs kernelspace).

Background on syscall numbers

Starting in xnu/osfmk/x86_64/idt64.s we find the interrupt and subsequent syscall handling code. Specifically, we find something kind of interesting: xnu is well known for having two "types" of syscalls: traditional unix syscalls and mach traps. Going back to old nemo articles we see him discuss three types of syscalls: mach traps (negative syscall no.), unix syscalls (positive syscall no. under 0x6000), and PPC syscalls (positive syscalls no. over 0x6000) [uninformed 4.3]. Today the layout is conceptually the same, but with more types of syscalls, and with different constants. The syscall number is stored in rax. The following defines how syscalls are dispatched based on the value in rax:

  • Mach Traps: rax & 0x01 << 24
  • Unix Syscall: rax & 0x02 << 24
  • Machine Dependent: rax & 0x03 << 24
  • Diagnostics: rax & 0x04 << 24
  • Mach IPC (unused?): rax & 0x05 << 24

For example: if we wanted unix syscall 1 (exit()), rax would need to be equal to 0x02 << 24 | 1, or 0x2000001. If we wanted mach trap 31 (mach_msg()), rax would need to be 0x100001f.

These come from a combination of the constants defined in xnu/osfmk/mach/i386/syscall_sw.h and hndl_syscall from xnu/osfmk/x86_64/idt64.s. Reading hndl_syscall will explain why when shellcoding for xnu you must add 0x2000000 to your syscall numbers -- otherwise they won't be appropriately dispatched to the right handler.

Tracing execution flow

To figure out where these values are pushed from registers to memory, we're going to trace execution from from userland to the respective kernel function. There are three common ways (coming from a linux background) to transition from usermode to kernelmode:

  1. int 0x80
  2. sysenter
  3. syscall

int 0x80

We'll start at the definition of the interrupt handler in xnu/osfmk/x86_64/idt_table.h. Here we see a few interesting things:

USER_TRAP_SPC(0x80, idt64_unix_scall)
USER_TRAP_SPC(0x81, idt64_mach_scall)
USER_TRAP_SPC(0x82, idt64_mdep_scall)

This is kinda cool -- we can directly jump into the dispatch functions by varrying our interrupt number. Traditionally on x86 machines, int 0x80 was a syscall. However, this indicates we can actually call to the kernel from usermode with any of these three interrupts (as long as we want the appropriate type of call). In fact, we must use the correct interrupt number when attempting to call into the kernel in this fashion (e.g a unix syscall with int 0x81 will fail).

sysenter

From OSDev, sysenter is not an interrupt, rather it's an instruction which transitions us to kernelspace from userspace. Specifically, the value of rip will be loaded from a model specific register (MSR) amongst other things when the sysenter instruction is executed. A bit of grepping leads us to osfmk/i386/mp_desc.c:

   wrmsr64(MSR_IA32_SYSENTER_EIP, (uintptr_t)hi64_sysenter);

This means when sysenter executes, the value of rip is set to hi64_sysenter, which is defined in xnu/osfmk/x86_64/idt64.s.

Interestingly, neither int 0x80 nor sysenter will work on amd64. If we trace the code out, we always end up in the 32bit code path, which kicks us to the 64bit code path (we end up in hndl_alltraps which calls user_trap from xnu/osfmk/i386/trap.c. This does not link us to any of the syscall dispatching that we need, and thus will not execute our system calls. As far as I can tell, from a 64bit binary you must enter the kernel through syscall.

syscall

syscall is very similar to sysenter, only with a different MSR. Again in osfmk/i386/mp_desc.c we find the relevant code:

   wrmsr64(MSR_IA32_LSTAR, (uintptr_t)hi64_syscall);

From this we can take away that when syscall executes, rip will be set to hi64_syscall, which is another function defined in our old friend xnu/osfmk/x86_64/idt64.s. From here, we'll see that we're loading hndl_syscall onto the stack, at the offset ISF64_TRAPFN (it's a macro which corresponds to a structure offset).

leaq	HNDL_SYSCALL(%rip), %r11;
movq	%r11, ISF64_TRAPFN(%rsp)

From here we branch to L_dispatch_U64 where rsp gets copied to r15 and then into L_dispatch_64bit which saves our user register state to the kernel stack. This means r15 is a pointer to a x86_saved_state_t, which is defined as a x86_saved_state64 (in xnu/osfmk/mach/i386/thread_status.h). We store the earlier saved value from ISF64_TRAPFN (which was hndl_syscall) in rdx and jump to L_common_dispatch which finally calls the function stored in rdx.

Following the unix syscall path in hndl_syscall we jump to hndl_unix_scall64 which in turn calls unix_syscall64 with a single argument of r15 (still our saved state). This function is defined in xnu/bsd/dev/i386/systemcalls.c. From here, it's easiest to just snip the relevant code to our question:

  thread = current_thread();
  uthread = get_bsdthread_info(thread);
  // regs is derrived from r15 ...
  code = regs->rax & SYSCALL_NUMBER_MASK;
  callp = (code >= NUM_SYSENT) ? &sysent[63] : &sysent[code];
  // ...
  vt = (void *)uthread->uu_arg;
  // ...
  memcpy(vt, args_start_at_rdi ? &regs->rdi : &regs->rsi,
        args_in_regs * sizeof(syscall_arg_t));
  // ...
  error = (*(callp->sy_call))((void *)p, vt, &(uthread->uu_rval[0]));

To briefly explain this code: first we're getting the current thread struct. Second we're getting the system call entry out of the syscall table. This includes the number of arguments the syscall expects, as well as the function pointer (sy_call). Third we're getting a chunk of memory out of the current thread struct, and finally we're copying the arguments from saved reg state into the specified memory on the kernels thread struct.

This pretty much solves our original mystery: the interrupt handler pushes all the registers onto the kernel stack, and that kernel stack is in turn copied into the thread's struct. The address of the memory inside the thread struct is passed to our syscall, which uses it for referencing all arguments.

Notes

As we've listed quite a few functions, below a sequential list of every function or label a standard unix syscall should hit in xnu, between the syscall and the start of the syscall function:

hi64_syscall
L_dispatch_U64
L_dispatch_64bit
L_common_dispatch
hndl_syscall // rdx, pushed in hi64_syscall
hndl_unix_scall64
unix_syscall64
error = (*(callp->sy_call))((void *)p, vt, &(uthread->uu_rval[0])); // now we're there
@bsmt

This comment has been minimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.