I used to think that executables were totally impenetrable. I'd compile a C program, and then that was it! I had a Magical Binary Executable that I could no longer read.
It is not so! Executable file formats are regular file formats that you can understand. I'll explain some simple tools to start! We'll working on Linux, with ELF binaries. (binaries are kind of the definition of platform-specific, so this is all platform-specific.)
Let's write a simple C program, hello.c
:
#include <stdio.h>
int main() {
printf("Hello!\n");
}
Then we compile it (gcc -o hello hello.c
), and we have a binary called hello
. This originally seems impenetrable (how do we even binary?!), but let's see how we can investigate it! We're going to learn what symbols, sections, and segments are. At a high level:
- symbols are like function names, and are used to answer "If I call
printf
and it's defined somewhere else, how do I find it?" - sections are
Throughout we'll use a tool called readelf
wh
So, let's dive into our binary!
This is most naive possible way to view a binary. If I open up my binary in a text editor, it looks something like this:
@8#T@T 1t@t$D���o�@Nhello
ELF>@@H@8 @@@@�@�V@=^���oV@k���o`@`z�@��@�0Pd@@←←QRd((◆(◆/┌☃b64/┌d
There's text here, though! This was not a total failure. In particular it says "hello" and "ELF". ELF is the name of the binary format. So that's something! Then there are a bunch of unprintable symbols, which isn't a huge surprise because this is a binary.
Throughout we're going to use a tool called readelf
to explore our binary. Let's start by running readelf --symbols
on it. (another popular tool to do this is nm
)
$ readelf --symbols hello
Num: Value Size Type Bind Vis Ndx Name
48: 0000000000000000 0 FUNC GLOBAL DEFAULT UND puts@@GLIBC_2.2.5
59: 0000000000400410 0 FUNC GLOBAL DEFAULT 13 _start
61: 00000000004004f4 16 FUNC GLOBAL DEFAULT 13 main
Here we see three symbols: main
is the address of my main()
function. puts
looks a reference to the printf
function I called in it (which I guess the compiler changed to puts
as an optimization?). _start
is pretty important.
When the program starts running, you might think it starts at main
. It doesn't! It actually goes to _start
. This does a bunch of Very Important Things that I don't understand very well, including calling main
. So I won't explain them.
So, what's a symbol?
When you compile a program, you might write a function called hello
. This results in a symbol called hello
. This is what allows linking to work! If I call a function (like printf
) from a library, the linker can find printf
by looking it up in my library's symbol table. printf
is in libc.
If I run nm
on libc, it tells me "no symbols". But the internet tells me I can use objdump -tT
instead! This works! objdump -tT /lib/x86_64-linux-gnu/libc-2.15.so
gives me this output.
If you look at it, you'll see sprintf
, strlen
, fork
, exec
, and everything you might expect libc to have. From here we can start to imagine how dynamic linking works -- we see that hello
calls puts
, and then we can look up the location of puts
in libc's symbol table.
Opening our binary in a text editor was a bad way to open it. objdump
is a better way. Here's an excerpt:
$ objdump -s hello
Contents of section .text:
400410 31ed4989 d15e4889 e24883e4 f0505449 1.I..^H..H...PTI
400420 c7c0a005 400048c7 c1100540 0048c7c7 ....@.H....@.H..
400430 f4044000 e8c7ffff fff49090 4883ec08 ..@.........H...
Contents of section .interp:
400238 2f6c6962 36342f6c 642d6c69 6e75782d /lib64/ld-linux-
400248 7838362d 36342e73 6f2e3200 x86-64.so.2.
Contents of section .rodata:
4005f8 01000200 48656c6c 6f2100 ....Hello!.
The are a whole bunch of sections here (see this gist for the whole thing). This shows you all the bytes in your binary! Some sections we care about:
.text
is the program's actual code (the assembly)._start
andmain
are both part of the.text
section..rodata
is where some read-only data is stored (in this case, our string "Hello!")
The major difference between sections and segments is that sections are used at link time (by ld
) and segments are used at execution time. objdump
shows us the contents of the sections, which is nice, but doesn't give us as much metadata about the sections as I'd like. Let's try readelf
instead:
$ readelf --sections hello
Section Headers:
[Nr] Name Type Address Offset
Size EntSize Flags Link Info Align
[13] .text PROGBITS 0000000000400410 00000410
00000000000001d8 0000000000000000 AX 0 0 16
[15] .rodata PROGBITS 00000000004005f8 000005f8
000000000000000b 0000000000000000 A 0 0 4
[24] .data PROGBITS 0000000000601010 00001010
0000000000000010 0000000000000000 WA 0 0 8
[25] .bss NOBITS 0000000000601020 00001020
0000000000000010 0000000000000000 WA 0 0 8
[26] .comment PROGBITS 0000000000000000 00001020
000000000000002a 0000000000000001 MS 0 0 1
Key to Flags:
W (write), A (alloc), X (execute), M (merge), S (strings), l (large)
I (info), L (link order), G (group), T (TLS), E (exclude), x (unknown)
O (extra OS processing required) o (OS specific), p (processor specific)
Finally, a program is organized into segments or program headers. Let's look at the segments for our program using readelf --segments hello
.
Program Headers:
[... removed ...]
INTERP 0x0000000000000238 0x0000000000400238 0x0000000000400238
0x000000000000001c 0x000000000000001c R 1
[Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
LOAD 0x0000000000000000 0x0000000000400000 0x0000000000400000
0x00000000000006d4 0x00000000000006d4 R E 200000
LOAD 0x0000000000000e28 0x0000000000600e28 0x0000000000600e28
0x00000000000001f8 0x0000000000000208 RW 200000
[... removed ...]
Section to Segment mapping:
Segment Sections...
00
01 .interp
02 .interp .note.ABI-tag .note.gnu.build-id .gnu.hash .dynsym .dynstr .gnu.version .gnu.version_r .rela.dyn .rela.plt .init .plt .text .fini .rodata .eh_frame_hdr .eh_frame
03 .ctors .dtors .jcr .dynamic .got .got.plt .data .bss
04 .dynamic
05 .note.ABI-tag .note.gnu.build-id
06 .eh_frame_hdr
07
08 .ctors .dtors .jcr .dynamic .got
Segments are used to determine how to separate different parts of the program into memory. The first LOAD
segment is marked R E (read / execute) and the second is RW
(read/write). .text
is in the first segment (we want to read it but never write to it), and .data
, .bss
are in the second (we need to write to them, but not execute them).
Executables aren't magic. ELF is a file format like any other! You can use readelf
, nm
, and objdump
to inspect your Linux binaries. Try it out! Have fun.
Other resources:
- I found this introduction to ELF helpful for explaining sections and segments
- There's a wonderful graphic showing the structure of an ELF binary.
- For learning more about how linkers work, there's a wonderful 20 part series about linkers, which I wrote about here and here.