There are many tutorials on the web that explain how to build a simple "Hello, World" in C without the use of libc
on AMD64, but most of them stop there.
This guide hopes to provide a more complete explanation that will allow you to build yourself a small framework to write more complex programs. The code will support both AMD64, and i386.
We will compile with the flag -g
as for debug information, as-well as no optimization -O0
to be able to see as much as possible in the debugger. You'll need to follow the next steps to see how to do this.
- Firstly, run the following.
$ cat > hello.c << "EOF"
#include <stdio.h>
int main(int argc, char* argv[])
{
printf("Hello, World\n");
return 0;
}
EOF
- To run this program, we'll run the following command.
$ gcc -O0 -g hello.c # After running, continue to the next line.
$ ./a.out
-
This outputs a simple "Hello, World", followed by a line feed in our console.
-
To debug this program, we'll use GNU's debugger,
gdb
on the output filea.out
$ gdb a.out
(gdb) break main
(gdb) run
(gdb) backtrace
- This will output
#0 main (argc=1, argv=0x7fffffffda08) at hello.c:5
Although we retrieve some useful information from this, past-entry information is still hidden from us. We need to specify to gdb
that we want to back-trace lib-c's past-main
and past-entry
functions.
$ gdb a.out
(gdb) break main
(gdb) run
(gdb) backtrace
(gdb) set backtrace past-main on
(gdb) set backtrace past-entry on
(gdb) bt
- Our new output
#0 main (argc=1, argv=0x7fffffffda08) at hello.c:5
#1 0x00007ffff7df52ca in ?? () from /lib64/libc.so.6
#2 0x00007ffff7df5385 in __libc_start_main () from /lib64/libc.so.6
#3 0x0000555555555071 in _start ()
That is definitely much better, as we can see, the first function that's actually called is _start
, which then calls __libc_start_main
which is clearly a standard library initialization function to invoke main.
You can take a look at _start
and __libc_start_main
in the glibc source if you're interested. It's not that interesting for us, as it sets a dynamic linker, and such that we will neveruse since we want a static executable.
Let's try recompiling our "Hello, World" program with optimization flags this time (-O2
), without debug information and with stripping (-s
) to see how large it is.
$ gcc -s -O2 hello.c
$ wc -c a.out
6208 a.out
6 KiB for a simple Hello World? That's a lot.
Even if I add another size optimization flag, such as -Wl, --gc-sections -fno-unwind-tables -fno-asynchronous-unwind-tables -Os
, it persists at 6Kibs.
We will now progressively strip this program down by first getting rid of the standard library, then learning how to invoke syscalls without the necessity of headers.
So how do we get rid of the standard library? Of course if we try to compile our current code with -nostdlib
we will run into linker errors. So first, let's trouble-shoot our linker errors
$ gcc -s -02 -nostdlib hello.c
/usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: warning: cannot find entry symbol _start; defaulting to 0000000000001020
/usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: /tmp/ccAZZZwG.o: in function `main':
hello.c:(.text.startup+0xc): undefined reference to `puts'
collect2: error: ld returned 1 exit status
The linker is complaining about _start
missing, which is what we would expect from our previous debugging.
We also have a linker error on puts, which is to be expected since it is a function included from libc
. But how do we print "Hello, World" without puts?
The Linux Kernel exposes a bunch of syscalls, which are functions that user-space programs can enter to interact with the Operating System. You can see a list of syscalls by running man syscalls
, or you can visit man7's syscalls webpage.
So, How do we find out which sycall puts uses? We can either look through the syscall list, or simple install strace to trace syscalls and write a simple program that uses puts.
The strace method is extemely useful to us. If you don't know how to do something with syscalls, do it with libc, then, strace it to decipher which syscalls it uses on the target architecture.
Let's try this out.
- Our simple program which uses
puts
fromstdio.h
.
#include <stdio.h>
int main(int argc, char* argv[])
{
puts("Hello, World");
return 0;
}
- Using strace to decipher the syscall we want.
$ gcc puts.c
$ strace ./a.out > /dev/null
write(1, "Hello, World\n", 13) = 13
exit_group(0) = ?
+++ exited with 0 +++
Note that
stdout
is piped to/dev/null
in strace, that's because strace outputs is in stderr and we don't want to have it mixed witha.out
's output.
So we can derive from this that puts uses the write
syscall.
Let's check the manpage for write.
$ man 2 write
NAME
write - write to a file descriptor
SYNOPSIS
#include <unistd.h>
ssize_t write(int fd, const void *buf, size_t count);
DESCRIPTION
write() writes up to count bytes from the buffer starting at buf
to the file referred to by the file descriptor fd.
In Linux, there are three stardard file descriptors,
stdin
Used to pipe data into the program or read user input.stdout
Used to output information.stderr
Used as an alternet output for error messaging.
If we read man stdout
, we read that these are simply defined as 0, 1, and 2.
So all we have to do is replace our puts()
with a write()
to stream 1, which is stdout
.
So let's try that.
#include <unistd.h>
int main(int argc, char* argv[])
{
write(1, "Hello, World\n", 13);
return 0;
}
Let's try to compile that again.
$ gcc -s -O2 -nostdlib hello.c
hello.c: In function 'main':
hello.c:5:5: warning: ignoring return value of 'write' declared with attribute 'warn_unused_result' [-Wunused-result]
5 | write(1, "Hello, World\n", 13);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: warning: cannot find entry symbol _start; defaulting to 0000000000001020
/usr/lib/gcc/x86_64-pc-linux-gnu/12/../../../../x86_64-pc-linux-gnu/bin/ld: /tmp/ccqJWxSf.o: in function `main':
hello.c:(.text.startup+0x16): undefined reference to `write'
collect2: error: ld returned 1 exit status
It seems our write()
function is also apart of the standard library. How do we invoke syscalls without having to link the standard lib?
Let's take a look at section A.2.1 Calling Conventions in the AMD64 ABI Specification.
If you're completely clueless about assembly, you should still be able to understand once you see an example.
User-level applications use as integer registers for passing the sequence
%rdi
,%rsi
,%rdx
,%rcx
,%r8
and%r9
. The kernel interface uses%rdi
,%rsi
,%rdx
,%r10
,%r8
and%r9
.A system-call is done via the
sycall
instruction. The kernel destroys registers%rcx
and%r11
.The number of the syscall has to be passed in register
%rax
.System-calls are limited to six arguments, no argument is passed directly on the stack.
Returning from the
syscall
, register%rax
contains the result of the system-call. A value in the range between -4095 and -1 indicates an error, it is-errno
.Only values of class INTEGER or class MEMORY are passed to the kernel.
System V Application Binary Interface, Appendix A § 2.1, Calling Conventions.
In poor words, all we need to do is write an assembly wrapper that will
- Take the syscall numbers followed by either pointers or integers as parameters.
- Set
%rax
to the syscall number. - Set
%rdi
,%rsi
,%rdx
,%r10
,%r9
, and%r8
to the parameters. Calls that take less than 6 arguments will ignore the excess ones. - Executes
syscall
. - Returns the content of
%rax
.
If we read section 3.4 of the specification or the quick cheatsheet on osdev.org, we will see that on AMD64, the registers used to pass parameters to regular functions are almost the same as the syscalls, except for %r10
which is replaced with %rcx
. The return register is also the same (%rax
).
This means that our syscall wrapper will only be able to accept and forward a maximum of five parameters, this is because the first parameter is already being used to pass a syscall number.
We could use the stack to take more than six arguments, but let's not make our lives more complicated when we don't even need to call syscalls with more than six parameters yet.
The Application Binary Interface also states that:
Registers
%rbp
,%rbx
, and%r12
through%r15
“belong” to the calling function and the called function is required to preserve their values. In other words, a called function must preserve these registers’ values for its caller. Remaining registers “belong” to the called function If a calling function wants to preserve such a register value across a function call, it must save the value in its local stack frame.
THis means that we don't have to worry about saving and restoring the values of %rdi
, %rsi
, %rdx
, %r10
, %r9
, and %r8
inside of our syscall wrapper, because it's up to the caller to save and gcc will take care of that because we are callling from C code.
By putting this all together, it will become our syscall wrapper.
mov %rdi, %rax /* %rax (syscall number) = func param 1 (%rdi) */
mov %rsi, %rdi /* %rdi (syscall param 1) = func param 2 (%rsi) */
mov %rdx, %rsi /* %rsi (syscall param 2) = func param 3 (%rdx) */
mov %rcx, %rdx /* %rdx (syscall param 3) = func param 4 (%rcx) */
mov %r8, %r10 /* %r10 (syscall param 4) = func param 5 (%r8) */
mov %r9, %r8 /* %r8 (syscall param 5) = func param 6 (%r9) */
syscall /* Enter a syscall (return value in %rax) */
ret /* Return value is already in %rax, we can return. */
How do we embed our arbitrary assembly into our program though? One day is via the gcc inline assembler. However, the syntax is ugly.
We're going to write a .S
file for the GNU Assembler, and compile and link it to our hello.c
program with gcc.
.global syscall5 /* Exporting syscall to other compilation units. */
.text /* Marking the .text, which marks the PE, making our program executable. */
syscall5:
mov %rdi, %rax
mov %rsi, %rdi
mov %rdx, %rsi
mov %rcx, %rdx
mov %r8, %r10
mov %r9, %r8
syscall
ret
To find any syscall numbers, refer to filippo.io/linux-syscall-table/.
Additionally, you can simply use a C preprocessor print it for you
$ printf "#include <sys/syscall.h>\n SYS_write" | gcc -E - | sed "/^#.*/d"
1
-E
Runs the preprocessor on the file, expanding all macros and therefore replacing#define
constants with their corresponding value.-
Means that we usestdin
as input, which we pipe here withprintf
.- We simply use sed to remove lines we don't want, I would assume you know what sed is.
- Optionally, you can use the
-m32
flag for 32-bit calls.
Syscall numbers are usually prefixed by
SYS_
.
Back to our prototype from earlier,
ssize_t write(int fd, const void *buf, size_t count);
ssize_t
andsize_t
are types defined by unistd. A quick inspection of the class reveals that they are 64-bit integers, and that the extras
inssize
means it is a signed value.
$ printf "#include <unistd.h>" | gcc -E - | grep size_t
typedef long int __blksize_t;
typedef long int __ssize_t;
typedef __ssize_t ssize_t;
typedef long unsigned int size_t;
If we try an -m32
flag, we see that this will be a 32-bit. This means that ssize_t
and size_t
are the same size as the architecture's pointers.
We can now import syscall5
from hello.s
into our hello.c
program and make a write function that calls it, that is demonstrated below.
void* syscall5(
void* number,
void* arg1,
void* arg2,
void* arg3,
void* arg4,
void* arg5
);
typedef unsigned long int uintptr; /* size_t */
typedef long int intptr; /* ssize_t */
static intptr write(int fd, void const* data, uintptr nbytes)
{
return (intptr)
syscall5(
(void*) 1, /* SYS_write, call number 1 */
(void*) (intptr) fd,
(void*) data,
(void*) nbytes,
0, /* Ignored */
0 /* Ignored */
);
}
int main(int argc, char* argv[])
{
write(1, "Hello, World\n", 13);
return 0;
}
See that (void*)(intptr)
double cast on fd
? If fd
is 32-bit and void*
is 64-bit, we would get a warning that we are implicitly casting it to a different size, so we need to explicitly specify that we want that conversion by adding the intptr
cast.
This should be done every time you cast to and from pointers when the destination type is not guaranteed to be the same size as pointers. Especially when targeting multiple architectures.
Note how we cast the
const
qualifier away from data to avoid a warning.
Back to the AMD64 ABI documentation. In figure 3.11, we can see the initial state of the stack.
argc is a non-negative argument count
argv is an array of argument strings, with
argv[argc] == 0
Figure 3.11: Initial Process Stack
Purpose Start Address Length Unspecified High Address Information block, including arguments, strings, environments strings, auxiliary information ... varies Unspecified Null auxiliary vector entry 1 eightbyte Auxiliary vector entries ... 2 eightbytes each 0 eightbyte Environment pointers ... 1 eightbyte each 0 8 + 8 * argc + % rsp
eightbyte Argument pointers 8 + %rsp
argc eightbytes Argument count %rsp
eightbyte Undefined Low Address
Although we don't care about this much, right beneath this figure, we have the initial state of the registers, which is very important to us.
%rbp
The content of this register is unspecified at process initialization time, but the user code should mark the deepest stack frame by setting the frame pointer to zero.
%rsp
The stack pointer holds the address of the byte with lowest address which is part of the stack. It is guaranteed to be 16-byte aligned at process entry.
%rdx
A function pointer that the application should register withatexit
(BA_OS).
So now that we know %rdp
must be zeroed, and that %rsp
points to the top of the stack. We don't need to worry about %rdx
.
If you don't understand how the stack works, it's just a chunk of memory where data is appended, and retrieved at the end. This is done through a push
and a pop
.
In AMD64's convention, we're actually prepending and removing data at the beginning of the memory sequence, since the stack is said to "grow downwards", which means that when we push something onto the stack, the stack pointer gets lower.
Since the ABI states that the stack pointer is 16-byte aligned, we must remember always to push data whose size is a multiple of 16. For example, 2 64-bit integers are 16 bytes. It's often necessary to either push useless data or simply align the stack pointer when the pushed values don't happen to be aligned.
To put it all together, our _start
function needs to do the following.
- Zero
%rbp
. - Put
argc
into%rdi
(first parameter for main). - Put the stack address of
argv[0]
into%rsi
(second parameter for main), which will be interpreted as an array of char pointers. - Align the stack to 16-bytes.
- Call main.
So, Let's do that,
- Our new
hello.s
should look something like this.
.global _start, syscall5 /* Exporting syscall to other compilation units. */
.text /* Marking the .text, which marks the PE, making our program executable. */
_start:
xor %rbp, %rbp /* XOR-ing a value with iself will set its value to 0. */
pop %rdi /* %rdi = argc, adds 8 to %rsp as-well. */
mov %rsp, %rsi /* Set the rest of the stack to an array of char pointers. */
/**
* Zero the last four bits of %rsp, aligning it to 16 bytes same
* as "and %rsp, 0xFFFFFFFFFFFFFFF0" because negative numbers
* are represented as max_unsigned_value + abs(negative_num)
*/
and $-16, %rsp
call main
ret
syscall5:
mov %rdi, %rax
mov %rsi, %rdi
mov %rdx, %rsi
mov %rcx, %rdx
mov %r8, %r10
mov %r9, %r8
syscall
ret
Unfortunately, upon exit of this program, it throws a segmentation fault.
$ gcc -s -O2 -nostdlib hello.s hello.c
$ ./a.out
Hello, World
Segmentation fault
But why?
When we execute a call instruction, the return address1 is pushed onto the stack implicitely and the ret intruction implicitly pops it and jumps to it.
The _start
procedure is very special, as it has no return type, which makes it a procedure, rather than a function. This seems to be our issue, as we can see, our ret
instruction in _start
is trying to jump back to _starts
return address, which is memory address that doesn't exist, or doesn't contain data relevent to our program, which triggers access violations.
We need to tell the OS to kill our process and never reach the ret
in _start
. The syscall _EXIT()
is just what we need:
- The Address of the instruction to jump to after a function returns.
- First, let's look at its man page.
$ man 2 _EXIT
NAME
_exit, _Exit - terminate the calling process
SYNOPSIS
#include <unistd.h>
noreturn void _exit(int status);
#include <stdlib.h>
noreturn void _Exit(int status);
- Now, let's use a preprocessor to locate the syscall number.
$ printf "#include <sys/syscall.h>\n SYS_exit" | gcc -E - | sed "/^#.*/d""
60
The status code will simply return the value of main, which is stored in %rax
as we know.
With this information, let's write a new hello.s
.
.global _start, syscall5 /* Exporting syscall to other compilation units. */
.text /* Marking the .text, which marks the PE, making our program executable. */
_start:
xor %rbp, %rbp /* Upon instructing XOR an two of the same operands, it will set its value to 0. */
pop %rdi /* %rdi = argc, adds 8 to %rsp as-well. */
mov %rsp, %rsi /* Set the rest of the stack to an array of char pointers. */
/**
* Zero the last four bits of %rsp, aligning it to 16 bytes same
* as "and %rsp, 0xFFFFFFFFFFFFFFF0" because negative numbers
* are represented as max_unsigned_value + abs(negative_num)
*/
and $-16, %rsp /* Not using Hex to better represent a negative decimal. */
call main
/**
* Our new syscall to SYS_exit.
*/
mov %rax, %rdi /* syscall param 1 = %rax (ret value of main) */
mov $0x3C, %rax /* 0x3C -> 60 in decimal, syscall for SYS_exit. */
syscall
ret /* This sholud now never be reached. */
syscall5:
mov %rdi, %rax
mov %rsi, %rdi
mov %rdx, %rsi
mov %rcx, %rdx
mov %r8, %r10
mov %r9, %r8
syscall
ret
Our program seems to finally terminate correctly!
$ gcc -s -O2 -nostdlib hello.s hello.c
$ ./a.out
Hello, World
We can shrink our executable size by removing unneeded unwind tables, we can do this by running the following.
$ gcc -s -O2 -nostdlib -fno-unwind-tables -fno-asynchronous-unwind-tables hello.s hello.c