Skip to content

Instantly share code, notes, and snippets.

@Scherso
Last active November 8, 2023 02:01
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Scherso/d11e7de8bd27f234519cfb94348d9c70 to your computer and use it in GitHub Desktop.
Save Scherso/d11e7de8bd27f234519cfb94348d9c70 to your computer and use it in GitHub Desktop.

Writing C Software without the Standard Library
FOR UN*X


There are many tutorials on the web that explain how to build a simple "Hello, World" in C without the use of libc on AMD64, but most of them stop there.

This guide hopes to provide a more complete explanation that will allow you to build yourself a small framework to write more complex programs. The code will support both AMD64, and i386.


Section I.

Writing a simple "Hello, World" program, and debugging it.

We will compile with the flag -g as for debug information, as-well as no optimization -O0 to be able to see as much as possible in the debugger. You'll need to follow the next steps to see how to do this.

  • Firstly, run the following.
$ cat > hello.c << "EOF"
#include <stdio.h>

int main(int argc, char* argv[])
{
    printf("Hello, World\n");
    return 0;
}
EOF
  • To run this program, we'll run the following command.
$ gcc -O0 -g hello.c # After running, continue to the next line.
$ ./a.out
  • This outputs a simple "Hello, World", followed by a line feed in our console.

  • To debug this program, we'll use GNU's debugger, gdb on the output file a.out

$ gdb a.out
(gdb) break main
(gdb) run
(gdb) backtrace
  • This will output
#0  main (argc=1, argv=0x7fffffffda08) at hello.c:5

Although we retrieve some useful information from this, past-entry information is still hidden from us. We need to specify to gdb that we want to back-trace lib-c's past-main and past-entry functions.

$ gdb a.out
(gdb) break main
(gdb) run
(gdb) backtrace
(gdb) set backtrace past-main on
(gdb) set backtrace past-entry on
(gdb) bt
  • Our new output
#0  main (argc=1, argv=0x7fffffffda08) at hello.c:5
#1  0x00007ffff7df52ca in ?? () from /lib64/libc.so.6
#2  0x00007ffff7df5385 in __libc_start_main () from /lib64/libc.so.6
#3  0x0000555555555071 in _start ()

That is definitely much better, as we can see, the first function that's actually called is _start, which then calls __libc_start_main which is clearly a standard library initialization function to invoke main.

You can take a look at _start and __libc_start_main in the glibc source if you're interested. It's not that interesting for us, as it sets a dynamic linker, and such that we will neveruse since we want a static executable.

Let's try recompiling our "Hello, World" program with optimization flags this time (-O2), without debug information and with stripping (-s) to see how large it is.

$ gcc -s -O2 hello.c
$ wc -c a.out
6208 a.out

6 KiB for a simple Hello World? That's a lot.

Even if I add another size optimization flag, such as -Wl, --gc-sections -fno-unwind-tables -fno-asynchronous-unwind-tables -Os, it persists at 6Kibs.

We will now progressively strip this program down by first getting rid of the standard library, then learning how to invoke syscalls without the necessity of headers.

So how do we get rid of the standard library? Of course if we try to compile our current code with -nostdlib we will run into linker errors. So first, let's trouble-shoot our linker errors

$ gcc -s -02 -nostdlib hello.c
/usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: warning: cannot find entry symbol _start; defaulting to 0000000000001020
/usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: /tmp/ccAZZZwG.o: in function `main':
hello.c:(.text.startup+0xc): undefined reference to `puts'
collect2: error: ld returned 1 exit status

The linker is complaining about _start missing, which is what we would expect from our previous debugging.

We also have a linker error on puts, which is to be expected since it is a function included from libc. But how do we print "Hello, World" without puts?

The Linux Kernel exposes a bunch of syscalls, which are functions that user-space programs can enter to interact with the Operating System. You can see a list of syscalls by running man syscalls, or you can visit man7's syscalls webpage.

So, How do we find out which sycall puts uses? We can either look through the syscall list, or simple install strace to trace syscalls and write a simple program that uses puts.

The strace method is extemely useful to us. If you don't know how to do something with syscalls, do it with libc, then, strace it to decipher which syscalls it uses on the target architecture.

Let's try this out.

  • Our simple program which uses puts from stdio.h.
#include <stdio.h>

int main(int argc, char* argv[])
{
    puts("Hello, World");
    return 0;
}
  • Using strace to decipher the syscall we want.
$ gcc puts.c
$ strace ./a.out > /dev/null
write(1, "Hello, World\n", 13)          = 13
exit_group(0)                           = ?
+++ exited with 0 +++

Note that stdout is piped to /dev/null in strace, that's because strace outputs is in stderr and we don't want to have it mixed with a.out's output.

So we can derive from this that puts uses the write syscall.

Let's check the manpage for write.

$ man 2 write
NAME

       write - write to a file descriptor
       
SYNOPSIS

       #include <unistd.h>

       ssize_t write(int fd, const void *buf, size_t count);

DESCRIPTION

       write() writes up to count bytes from the buffer starting at buf
       to the file referred to by the file descriptor fd.

In Linux, there are three stardard file descriptors,

  • stdin Used to pipe data into the program or read user input.
  • stdout Used to output information.
  • stderr Used as an alternet output for error messaging.

If we read man stdout, we read that these are simply defined as 0, 1, and 2.

So all we have to do is replace our puts() with a write() to stream 1, which is stdout.

So let's try that.

#include <unistd.h>

int main(int argc, char* argv[])
{
    write(1, "Hello, World\n", 13);
    return 0;
}

Let's try to compile that again.

$ gcc -s -O2 -nostdlib hello.c
hello.c: In function 'main':
hello.c:5:5: warning: ignoring return value of 'write' declared with attribute 'warn_unused_result' [-Wunused-result]
    5 |     write(1, "Hello, World\n", 13);
      |     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: warning: cannot find entry symbol _start; defaulting to 0000000000001020
/usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: /tmp/ccqJWxSf.o: in function `main':
hello.c:(.text.startup+0x16): undefined reference to `write'
collect2: error: ld returned 1 exit status

It seems our write() function is also apart of the standard library. How do we invoke syscalls without having to link the standard lib?


Section II.

Using AMD64 Calling Conventions, ASM, Putting things together.

Let's take a look at section A.2.1 Calling Conventions in the AMD64 ABI Specification.

If you're completely clueless about assembly, you should still be able to understand once you see an example.

  1. User-level applications use as integer registers for passing the sequence %rdi, %rsi, %rdx, %rcx, %r8 and %r9. The kernel interface uses %rdi, %rsi, %rdx, %r10, %r8 and %r9.

  2. A system-call is done via the sycall instruction. The kernel destroys registers %rcx and %r11.

  3. The number of the syscall has to be passed in register %rax.

  4. System-calls are limited to six arguments, no argument is passed directly on the stack.

  5. Returning from the syscall, register %rax contains the result of the system-call. A value in the range between -4095 and -1 indicates an error, it is -errno.

  6. Only values of class INTEGER or class MEMORY are passed to the kernel.

System V Application Binary Interface, Appendix A § 2.1, Calling Conventions.

In poor words, all we need to do is write an assembly wrapper that will

  • Take the syscall numbers followed by either pointers or integers as parameters.
  • Set %rax to the syscall number.
  • Set %rdi, %rsi, %rdx, %r10, %r9, and %r8 to the parameters. Calls that take less than 6 arguments will ignore the excess ones.
  • Executes syscall.
  • Returns the content of %rax.

If we read section 3.4 of the specification or the quick cheatsheet on osdev.org, we will see that on AMD64, the registers used to pass parameters to regular functions are almost the same as the syscalls, except for %r10 which is replaced with %rcx. The return register is also the same (%rax).

This means that our syscall wrapper will only be able to accept and forward a maximum of five parameters, this is because the first parameter is already being used to pass a syscall number.

We could use the stack to take more than six arguments, but let's not make our lives more complicated when we don't even need to call syscalls with more than six parameters yet.

The Application Binary Interface also states that:

Registers %rbp, %rbx, and %r12 through %r15 “belong” to the calling function and the called function is required to preserve their values. In other words, a called function must preserve these registers’ values for its caller. Remaining registers “belong” to the called function If a calling function wants to preserve such a register value across a function call, it must save the value in its local stack frame.

THis means that we don't have to worry about saving and restoring the values of %rdi, %rsi, %rdx, %r10, %r9, and %r8 inside of our syscall wrapper, because it's up to the caller to save and gcc will take care of that because we are callling from C code.

By putting this all together, it will become our syscall wrapper.

mov    %rdi, %rax /* %rax (syscall number)  = func param 1 (%rdi)    */
mov    %rsi, %rdi /* %rdi (syscall param 1) = func param 2 (%rsi)    */
mov    %rdx, %rsi /* %rsi (syscall param 2) = func param 3 (%rdx)    */
mov    %rcx, %rdx /* %rdx (syscall param 3) = func param 4 (%rcx)    */
mov    %r8,  %r10 /* %r10 (syscall param 4) = func param 5 (%r8)     */
mov    %r9,  %r8  /* %r8  (syscall param 5) = func param 6 (%r9)     */
syscall           /* Enter a syscall (return value in %rax)          */
ret               /* Return value is already in %rax, we can return. */

How do we embed our arbitrary assembly into our program though? One day is via the gcc inline assembler. However, the syntax is ugly.

We're going to write a .S file for the GNU Assembler, and compile and link it to our hello.c program with gcc.

.global syscall5 /* Exporting syscall to other compilation units. */
.text            /* Marking the .text, which marks the PE, making our program executable. */

syscall5:
        mov    %rdi, %rax
        mov    %rsi, %rdi
        mov    %rdx, %rsi
        mov    %rcx, %rdx
        mov    %r8,  %r10
        mov    %r9,  %r8
        syscall
        ret

To find any syscall numbers, refer to filippo.io/linux-syscall-table/.

Additionally, you can simply use a C preprocessor print it for you

$ printf "#include <sys/syscall.h>\n SYS_write" | gcc -E - | sed "/^#.*/d"
1
  • -E Runs the preprocessor on the file, expanding all macros and therefore replacing #define constants with their corresponding value.
  • - Means that we use stdin as input, which we pipe here with printf.
  • We simply use sed to remove lines we don't want, I would assume you know what sed is.
  • Optionally, you can use the -m32 flag for 32-bit calls.

Syscall numbers are usually prefixed by SYS_.

Back to our prototype from earlier,

ssize_t write(int fd, const void *buf, size_t count);
  • ssize_t and size_t are types defined by unistd. A quick inspection of the class reveals that they are 64-bit integers, and that the extra s in ssize means it is a signed value.
$ printf "#include <unistd.h>" | gcc -E - | grep size_t
typedef long int __blksize_t;
typedef long int __ssize_t;
typedef __ssize_t ssize_t;
typedef long unsigned int size_t;

If we try an -m32 flag, we see that this will be a 32-bit. This means that ssize_t and size_t are the same size as the architecture's pointers.

We can now import syscall5 from hello.s into our hello.c program and make a write function that calls it, that is demonstrated below.

void* syscall5(
        void* number,
        void* arg1,
        void* arg2,
        void* arg3,
        void* arg4,
        void* arg5
);

typedef unsigned long int uintptr;  /* size_t */
typedef long int intptr;            /* ssize_t */

static intptr write(int fd, void const* data, uintptr nbytes)
{
    return (intptr)
    syscall5(
        (void*) 1,            /* SYS_write, call number 1 */
        (void*) (intptr) fd,
        (void*) data,
        (void*) nbytes,
        0,                    /* Ignored */
        0                     /* Ignored */
    );
}

int main(int argc, char* argv[])
{
    write(1, "Hello, World\n", 13);
    return 0;
}

See that (void*)(intptr) double cast on fd? If fd is 32-bit and void* is 64-bit, we would get a warning that we are implicitly casting it to a different size, so we need to explicitly specify that we want that conversion by adding the intptr cast.

This should be done every time you cast to and from pointers when the destination type is not guaranteed to be the same size as pointers. Especially when targeting multiple architectures.

Note how we cast the const qualifier away from data to avoid a warning.

Back to the AMD64 ABI documentation. In figure 3.11, we can see the initial state of the stack.

argc is a non-negative argument count

argv is an array of argument strings, with argv[argc] == 0

Figure 3.11: Initial Process Stack

Purpose Start Address Length
Unspecified High Address
Information block, including arguments, strings, environments strings, auxiliary information ... varies
Unspecified
Null auxiliary vector entry 1 eightbyte
Auxiliary vector entries ... 2 eightbytes each
0 eightbyte
Environment pointers ... 1 eightbyte each
0 8 + 8 * argc + % rsp eightbyte
Argument pointers 8 + %rsp argc eightbytes
Argument count %rsp eightbyte
Undefined Low Address

Although we don't care about this much, right beneath this figure, we have the initial state of the registers, which is very important to us.

%rbp The content of this register is unspecified at process initialization time, but the user code should mark the deepest stack frame by setting the frame pointer to zero.

%rsp The stack pointer holds the address of the byte with lowest address which is part of the stack. It is guaranteed to be 16-byte aligned at process entry.

%rdx A function pointer that the application should register with atexit (BA_OS).

So now that we know %rdp must be zeroed, and that %rsp points to the top of the stack. We don't need to worry about %rdx.

If you don't understand how the stack works, it's just a chunk of memory where data is appended, and retrieved at the end. This is done through a push and a pop.

In AMD64's convention, we're actually prepending and removing data at the beginning of the memory sequence, since the stack is said to "grow downwards", which means that when we push something onto the stack, the stack pointer gets lower.

Since the ABI states that the stack pointer is 16-byte aligned, we must remember always to push data whose size is a multiple of 16. For example, 2 64-bit integers are 16 bytes. It's often necessary to either push useless data or simply align the stack pointer when the pushed values don't happen to be aligned.

To put it all together, our _start function needs to do the following.

  • Zero %rbp.
  • Put argc into %rdi (first parameter for main).
  • Put the stack address of argv[0] into %rsi (second parameter for main), which will be interpreted as an array of char pointers.
  • Align the stack to 16-bytes.
  • Call main.

So, Let's do that,

  • Our new hello.s should look something like this.
.global _start, syscall5   /* Exporting syscall to other compilation units. */
.text                      /* Marking the .text, which marks the PE, making our program executable. */

_start:
        xor    %rbp, %rbp       /* XOR-ing a value with iself will set its value to 0. */
        pop    %rdi             /* %rdi = argc, adds 8 to %rsp as-well. */
        mov    %rsp, %rsi       /* Set the rest of the stack to an array of char pointers. */
    
        /**
         * Zero the last four bits of %rsp, aligning it to 16 bytes same 
         * as "and %rsp, 0xFFFFFFFFFFFFFFF0" because negative numbers 
         * are represented as max_unsigned_value + abs(negative_num) 
         */
        and    $-16, %rsp
        call   main
        ret
     
syscall5:
        mov    %rdi, %rax
        mov    %rsi, %rdi
        mov    %rdx, %rsi
        mov    %rcx, %rdx
        mov    %r8,  %r10
        mov    %r9,  %r8
        syscall
        ret

Unfortunately, upon exit of this program, it throws a segmentation fault.

$ gcc -s -O2 -nostdlib hello.s hello.c
$ ./a.out
Hello, World
Segmentation fault

But why?

When we execute a call instruction, the return address1 is pushed onto the stack implicitely and the ret intruction implicitly pops it and jumps to it.

The _start procedure is very special, as it has no return type, which makes it a procedure, rather than a function. This seems to be our issue, as we can see, our ret instruction in _start is trying to jump back to _starts return address, which is memory address that doesn't exist, or doesn't contain data relevent to our program, which triggers access violations.

We need to tell the OS to kill our process and never reach the ret in _start. The syscall _EXIT() is just what we need:

  1. The Address of the instruction to jump to after a function returns.
  • First, let's look at its man page.
$ man 2 _EXIT
NAME

       _exit, _Exit - terminate the calling process
       
SYNOPSIS

       #include <unistd.h>

       noreturn void _exit(int status);

       #include <stdlib.h>

       noreturn void _Exit(int status);
  • Now, let's use a preprocessor to locate the syscall number.
$ printf "#include <sys/syscall.h>\n SYS_exit" | gcc -E - | sed "/^#.*/d""
60

The status code will simply return the value of main, which is stored in %rax as we know.

With this information, let's write a new hello.s.

.global _start, syscall5         /* Exporting syscall to other compilation units. */
.text                            /* Marking the .text, which marks the PE, making our program executable. */

_start:
        xor    %rbp,  %rbp       /* Upon instructing XOR an two of the same operands, it will set its value to 0. */
        pop    %rdi              /* %rdi = argc, adds 8 to %rsp as-well. */
        mov    %rsp,  %rsi       /* Set the rest of the stack to an array of char pointers. */
	
        /**
         * Zero the last four bits of %rsp, aligning it to 16 bytes same 
         * as "and %rsp, 0xFFFFFFFFFFFFFFF0" because negative numbers 
         * are represented as max_unsigned_value + abs(negative_num) 
         */
        and    $-16,  %rsp       /* Not using Hex to better represent a negative decimal. */
        call   main
	
        /**
         * Our new syscall to SYS_exit.
         */
        mov    %rax,  %rdi       /* syscall param 1 = %rax (ret value of main) */
        mov    $0x3C, %rax       /* 0x3C -> 60 in decimal, syscall for SYS_exit. */
        syscall
    
        ret                      /* This sholud now never be reached. */
     
syscall5:
        mov    %rdi,  %rax
        mov    %rsi,  %rdi
        mov    %rdx,  %rsi
        mov    %rcx,  %rdx
        mov    %r8,   %r10
        mov    %r9,   %r8
        syscall
        ret

Our program seems to finally terminate correctly!

$ gcc -s -O2 -nostdlib hello.s hello.c 
$ ./a.out
Hello, World

We can shrink our executable size by removing unneeded unwind tables, we can do this by running the following.

$ gcc -s -O2 -nostdlib -fno-unwind-tables -fno-asynchronous-unwind-tables hello.s hello.c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment