Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

@amitsaha
Last active December 29, 2023 19:00
Show Gist options
  • Star 64 You must be signed in to star a gist
  • Fork 15 You must be signed in to fork a gist
  • Save amitsaha/8169242 to your computer and use it in GitHub Desktop.
Save amitsaha/8169242 to your computer and use it in GitHub Desktop.
How does `ls` work?

How does ls work?

I wanted to be really able to explain to a fair amount of detail how does the program ls actually work right from the moment you type the command name and hit ENTER. What goes on in user space and and in kernel space? This is my attempt and what I have learned so far on Linux (Fedora 19, 3.x kernel).

How does the shell find the location of 'ls' ?

In user space, the location of ls is first searched in the locations in PATH. We can see the search by placing break points at the key locations in BASH's source code:

(gdb) b search_for_command
Breakpoint 1 at 0x4661d0: file findcmd.c, line 303.
(gdb) run -c ls
Starting program: /home/gene/work/linux_system_experiments/how-does-ls-work/sources/bash/./bash -c ls

Breakpoint 1, search_for_command (pathname=0x6f42b8 "ls") at findcmd.c:303
303     {
Missing separate debuginfos, use: debuginfo-install ncurses-libs-5.9-11.20130511.fc19.x86_64
(gdb) bt
#0  search_for_command (pathname=0x6f42b8 "ls") at findcmd.c:303
#1  0x0000000000430c68 in execute_disk_command (cmdflags=64, fds_to_close=0x6fe6a8, async=0, pipe_out=-1, pipe_in=-1, command_line=0x6f42c8 "ls", 
redirects=0x0, words=<optimized out>) at execute_cmd.c:4668
#2  execute_simple_command (simple_command=<optimized out>, pipe_in=-1, pipe_out=-1, async=<optimized out>, fds_to_close=0x6fe6a8) at execute_cmd.c:3990
#3  0x000000000043247e in execute_command_internal (command=0x6fe588, asynchronous=0, pipe_in=-1, pipe_out=-1, fds_to_close=0x6fe6a8) at execute_cmd.c:735
#4  0x000000000046fe8b in parse_and_execute (string=<optimized out>, from_file=from_file@entry=0x4a7799 "-c", flags=flags@entry=4) at evalstring.c:319
#5  0x000000000041d72a in run_one_command (command=<optimized out>) at shell.c:1315
#6  0x000000000041c506 in main (argc=3, argv=0x7fffffffdc38, env=0x7fffffffdc58) at shell.c:688

As you can see from the line #0 above, the argument, pathname refers to the string ls which now has to be searched in the locations specified by the user's PATH variable:

(gdb) b find_user_command_in_path
Breakpoint 2 at 0x465ed0: file findcmd.c, line 552.
(gdb) cont
Continuing.

Breakpoint 2, find_user_command_in_path (name=0x6f42b8 "ls", 
path_list=0x6f7008 "/usr/lib64/qt-3.3/bin:/usr/lib64/ccache:/usr/local/bin:/usr/bin:/bin:/usr/games:/usr/local/sbin:/usr/sbin:/home/gene/.local/bin:/home/gene/bin", flags=36) at findcmd.c:552
552     {
(gdb) bt
#0  find_user_command_in_path (name=0x6f42b8 "ls", 
path_list=0x6f7008 "/usr/lib64/qt-3.3/bin:/usr/lib64/ccache:/usr/local/bin:/usr/bin:/bin:/usr/games:/usr/local/sbin:/usr/sbin:/home/gene/.local/bin:/home/gene/bin", flags=36) at findcmd.c:552
#1  0x0000000000466220 in find_user_command_internal (flags=36, name=0x6f42b8 "ls") at findcmd.c:268
#2  find_user_command (name=0x6f42b8 "ls") at findcmd.c:213
#3  search_for_command (pathname=0x6f42b8 "ls") at findcmd.c:354
#4  0x0000000000430c68 in execute_disk_command (cmdflags=64, fds_to_close=0x6fe6a8, async=0, pipe_out=-1, pipe_in=-1, command_line=0x6f42c8 "ls", 
redirects=0x0, words=<optimized out>) at execute_cmd.c:4668
#5  execute_simple_command (simple_command=<optimized out>, pipe_in=-1, pipe_out=-1, async=<optimized out>, fds_to_close=0x6fe6a8) at execute_cmd.c:3990
#6  0x000000000043247e in execute_command_internal (command=0x6fe588, asynchronous=0, pipe_in=-1, pipe_out=-1, fds_to_close=0x6fe6a8) at execute_cmd.c:735
#7  0x000000000046fe8b in parse_and_execute (string=<optimized out>, from_file=from_file@entry=0x4a7799 "-c", flags=flags@entry=4) at evalstring.c:319
#8  0x000000000041d72a in run_one_command (command=<optimized out>) at shell.c:1315
#9  0x000000000041c506 in main (argc=3, argv=0x7fffffffdc38, env=0x7fffffffdc58) at shell.c:688

From the above backtrace we can see that search_for_command() calls find_user_command() and eventually find_user_command_in_path is called (most likely, because the call to find_user_command_internal() fails, since ls is an external command). The find_user_command_in_path() function calls the find_in_path_element() function for each of the locations specifiied in the PATH variable, unless it finds the correct location for the executable file corresponding to the command ls. The following gdb session illustrates this:

(gdb) b find_in_path_element
Breakpoint 3 at 0x465d10: file findcmd.c, line 467.
(gdb) cont
Continuing.

Breakpoint 3, find_in_path_element (name=name@entry=0x6f42b8 "ls", path=path@entry=0x6fea08 "/usr/lib64/qt-3.3/bin", flags=flags@entry=36, 
dotinfop=dotinfop@entry=0x7fffffffd750, name_len=<optimized out>) at findcmd.c:467
467     find_in_path_element (name, path, flags, name_len, dotinfop)
(gdb) bt
#0  find_in_path_element (name=name@entry=0x6f42b8 "ls", path=path@entry=0x6fea08 "/usr/lib64/qt-3.3/bin", flags=flags@entry=36, 
dotinfop=dotinfop@entry=0x7fffffffd750, name_len=<optimized out>) at findcmd.c:467
#1  0x0000000000465fd0 in find_user_command_in_path (name=0x6f42b8 "ls", 
     path_list=0x6f7008 "/usr/lib64/qt-3.3/bin:/usr/lib64/ccache:/usr/local/bin:/usr/bin:/bin:/usr/games:/usr/local/sbin:/usr/sbin:/home/gene/.local/bin:/home/gene/bin", flags=36) at findcmd.c:586
#2  0x0000000000466220 in find_user_command_internal (flags=36, name=0x6f42b8 "ls") at findcmd.c:268
#3  find_user_command (name=0x6f42b8 "ls") at findcmd.c:213
#4  search_for_command (pathname=0x6f42b8 "ls") at findcmd.c:354
#5  0x0000000000430c68 in execute_disk_command (cmdflags=64, fds_to_close=0x6fe6a8, async=0, pipe_out=-1, pipe_in=-1, command_line=0x6f42c8 "ls", 
redirects=0x0, words=<optimized out>) at execute_cmd.c:4668
#6  execute_simple_command (simple_command=<optimized out>, pipe_in=-1, pipe_out=-1, async=<optimized out>, fds_to_close=0x6fe6a8) at execute_cmd.c:3990
#7  0x000000000043247e in execute_command_internal (command=0x6fe588, asynchronous=0, pipe_in=-1, pipe_out=-1, fds_to_close=0x6fe6a8) at execute_cmd.c:735
#8  0x000000000046fe8b in parse_and_execute (string=<optimized out>, from_file=from_file@entry=0x4a7799 "-c", flags=flags@entry=4) at evalstring.c:319
#9  0x000000000041d72a in run_one_command (command=<optimized out>) at shell.c:1315
#10 0x000000000041c506 in main (argc=3, argv=0x7fffffffdc38, env=0x7fffffffdc58) at shell.c:688

(gdb) cont
Continuing.

Breakpoint 3, find_in_path_element (name=name@entry=0x6f42b8 "ls", path=path@entry=0x6fe948 "/usr/lib64/ccache", flags=flags@entry=36, 
       dotinfop=dotinfop@entry=0x7fffffffd750, name_len=<optimized out>) at findcmd.c:467
467     find_in_path_element (name, path, flags, name_len, dotinfop)
(gdb) cont
Continuing.

Breakpoint 3, find_in_path_element (name=name@entry=0x6f42b8 "ls", path=path@entry=0x6fe948 "/usr/local/bin", flags=flags@entry=36, 
dotinfop=dotinfop@entry=0x7fffffffd750, name_len=<optimized out>) at findcmd.c:467
467     find_in_path_element (name, path, flags, name_len, dotinfop)
(gdb) cont
Continuing.

Breakpoint 3, find_in_path_element (name=name@entry=0x6f42b8 "ls", path=path@entry=0x6fe948 "/usr/bin", flags=flags@entry=36, 
dotinfop=dotinfop@entry=0x7fffffffd750, name_len=<optimized out>) at findcmd.c:467
467     find_in_path_element (name, path, flags, name_len, dotinfop)
(gdb) 

Continuing.
process 15253 is executing new program: /usr/bin/ls

And finally it, finds /usr/bin/ls and calls execve() with to execute the command (More on this in the next section). The stat() system call is used to check the existence of an executable ls in the above locations. This is a snippet of the calls to stat():

stat("/usr/lib64/qt-3.3/bin/ls", 0x7fff8c535c40) = -1 ENOENT (No such
file or directory)
stat("/usr/lib64/ccache/ls", 0x7fff8c535c40) = -1 ENOENT (No such file
or directory)
stat("/usr/local/bin/ls", 0x7fff8c535c40) = -1 ENOENT (No such file or
directory)
stat("/usr/bin/ls", {st_mode=S_IFREG|0755, st_size=120232, ...}) = 0
stat("/usr/bin/ls", {st_mode=S_IFREG|0755, st_size=120232, ...}) = 0

So far, BASH has found out the location of the executable corresponding to the command ls. To get to this point, the filesystem had to be traversed at the locations in PATH. Let us dive into the kernel space to see how this is being done. We will use the following SystemTap script to trace the call and return from the function vfs_fstatat() in the file fs/stat.c:

probe kernel.function("vfs_fstatat@fs/stat.c").call
{
    # we are only interested in calls to vfs_fstatat() from "bash"
    # assuming that only one user is using "bash" to execute "ls"
    if(execname() == "bash")
        printf("%s -> %s %s\n", thread_indent(-1), probefunc(), kernel_string($filename)); 
}

probe kernel.function("vfs_fstatat@fs/stat.c").return
{
    # same as above
    if(execname() == "bash")
        printf("%s <- %s\n", thread_indent(-1), probefunc());
}

probe timer.ms(300000)
{
    exit() 
}

The stat() system call is defined as follows:

SYSCALL_DEFINE2(stat, const char __user *, filename, struct __old_kernel_stat __user *, statbuf)
{
    struct kstat stat;
    int error;

    error = vfs_stat(filename, &stat);
    if (error)
            return error;

    return cp_old_stat(&stat, statbuf);

}

The vfs_stat() function in turn is defined as follows:

int vfs_stat(const char __user *name, struct kstat *stat)
{
    return vfs_fstatat(AT_FDCWD, name, stat, 0);
}

Hence, we trace the call and return from the vfs_fstatat() function which has the following prototype:

int vfs_fstatat(int dfd, const char __user *filename, struct kstat
*stat, int flag)

The parameter, filename is what we are interested in here. When you run the SystemTap script, you will see the following lines:

# stap find_ls.stp

Pass 1: parsed user script and 110 library script(s) using
221520virt/38768res/3072shr/36596data kb, in 120usr/20sys/142real ms.
Pass 2: analyzed script: 3 probe(s), 16 function(s), 4 embed(s), 2
global(s) using 432420virt/92908res/4344shr/91140data kb, in
340usr/410sys/752real ms.
Pass 3: translated to C into
"/tmp/stap10Um7s/stap_3647789271d0793b6962cabaec032633_5961_src.c"
using 429960virt/95976res/7532shr/91140data kb, in
110usr/170sys/284real ms.
Pass 4: compiled C into
"stap_3647789271d0793b6962cabaec032633_5961.ko" in
2510usr/340sys/2707real ms.
Pass 5: starting run.

Now, execute the ls command in another terminal window, you should see these lines in the SystemTap window:

741 bash(18259): -> vfs_fstatat /usr/lib64/qt-3.3/bin/ls
754 bash(18259): <- SYSC_newstat
762 bash(18259): -> vfs_fstatat /usr/lib64/ccache/ls
772 bash(18259): <- SYSC_newstat
780 bash(18259): -> vfs_fstatat /usr/local/bin/ls
790 bash(18259): <- SYSC_newstat
797 bash(18259): -> vfs_fstatat /usr/bin/ls

Note that each of these vfs_fstatat() calls also results in accessing the underlying filesystem (traversing the directories, and files - a topic which we explore in some detail in the third section, since that is similar to how the kernel enables ls to do what it does).

At this stage, we have a fairly reasonable idea of what happens in user space and kernel space so that the location of the program that the ls corresponds to is found. Now, we are ready to see how the executable binary is executed.

How does the shell execute the command?

A call to execve() is made, which is a system call, defined as follows:

SYSCALL_DEFINE3(execve,
               const char __user *, filename,
               const char __user *const __user *, argv,
               const char __user *const __user *, envp)
{
    struct filename *path = getname(filename);
    int error = PTR_ERR(path);
    if (!IS_ERR(path)) {
            error = do_execve(path->name, argv, envp);
            putname(path);
    }
    return error; 
}

Effectively, it is the do_execve() function which does the work. It has the following prototype:

static int do_execve_common(const char *filename,
                            struct user_arg_ptr argv,
                            struct user_arg_ptr envp)

Using the following SystemTap script, we place a probe at this function:

# TODO: find a way to get access to the argv and argp
probe kernel.function("do_execve@fs/exec.c")
{
    if(execname() == "bash")
    printf("%s -> %s %s\n", thread_indent(1), probefunc(), kernel_string($filename));
}

probe timer.ms(30000)
{
    exit()
}

Run this script and execute ls in another terminal window and you will see:

0 bash(26013): -> SyS_execve /usr/bin/ls

There is a bunch of other things that needs to be done before the binary /usr/bin/ls is executed - the program has to be read from the disk, it's binary format needs to be found and the appropriate handling code needs to be invoked which will read the binary into memory. The following SystemTap script probes at some of the key functions which allows us to observe how the /usr/bin/ls binary is loaded into memory:

# TODO: find a way to get access to the argv and argp
probe kernel.function("do_execve_common@fs/exec.c")
{
    if(execname() == "bash")
        printf("%s %s\n", probefunc(), kernel_string($filename)); 
}

probe kernel.function("search_binary_handler@fs/exec.c").call
{
    if(execname() == "bash")
        printf("%s -> %s Executable: %s Interpreter: %s\n", thread_indent(1), probefunc(), kernel_string($bprm->filename),  kernel_string($bprm->interp));
}

probe kernel.function("search_binary_handler@fs/exec.c").return
{
    if(execname() == "bash")
        printf("%s <- %s \n", thread_indent(-1), probefunc());
}

probe kernel.function("open_exec@fs/exec.c").call
{
    if(execname() == "bash")
        printf("%s -> %s %s\n", thread_indent(1), probefunc(), kernel_string($name));
}

probe kernel.function("open_exec@fs/exec.c").return
{
    if(execname() == "bash")
        printf("%s <- %s \n", thread_indent(-1), probefunc());
}

probe kernel.function("load_elf_binary@fs/binfmt_elf.c").call
{
    if(execname() == "bash")
        printf("%s %s: Executable: %s Interpreter: %s\n", thread_indent(1), probefunc(), kernel_string($bprm->filename),  kernel_string($bprm->interp));
}

probe timer.ms(30000)
{
    exit();
}

Run the SystemTap script and execute ls in another window and you will see the following in the SystemTap window:

do_execve_common.isra.24 /bin/ls
     0 bash(7988): -> open_exec /bin/ls
     25 bash(7988): <- do_execve_common.isra.24 
     0 bash(7988): -> search_binary_handler Executable: /bin/ls Interpreter: /bin/ls
     14 bash(7988):  load_elf_binary: Executable: /bin/ls Interpreter: /bin/ls
     27 bash(7988):   -> open_exec /lib64/ld-linux-x86-64.so.2
     54 bash(7988):   <- load_elf_binary 

The search_binary_handler() function iterates the list of currently supported binary formats and once it finds that the executable is a supported format, proceeds to call the appropriate function to load the binary which is the function load_elf_binary() in this case. This becomes quite interesting when we execute a script with a #! (I have some experiments which I hope to share in a next article). Also, note how the glibc loader is also being opened, since /usr/bin/ls dynamically loads glibc into memory. To see how things are different when you compile a program statically, let's consider this C program:

# include <stdio.h>

int main(int argc, char **argv)
{
  printf("Hello World\n");
  return 0;
}

Compile it with gcc -o simple simple.c and execute it while keeping the above SystemTap script running. You will see the following:

do_execve_common.isra.24 ./simple
     0 bash(8102): -> open_exec ./simple
     20 bash(8102): <- do_execve_common.isra.24 
     0 bash(8102): -> search_binary_handler Executable: ./simple Interpreter: ./simple
     13 bash(8102):  load_elf_binary: Executable: ./simple Interpreter:./simple
     26 bash(8102):   -> open_exec /lib64/ld-linux-x86-64.so.2
     56 bash(8102):   <- load_elf_binary 

Now, compile the program, passin the -static flag to gcc as gcc -o simple_static simple.c -static and execute the program. You will see the following output in SystemTap's window:

do_execve_common.isra.24 ./simple_static
    0 bash(8119): -> open_exec ./simple_static
    23 bash(8119): <- do_execve_common.isra.24 
    0 bash(8119): -> search_binary_handler Executable:./simple_static Interpreter: ./simple_static
    20 bash(8119):  load_elf_binary: Executable: ./simple_static Interpreter: ./simple_static

In this case, you can see that the loader is not being opened any more. After this, bunch of things such as setting up the memory areas, copying over the arguments, etc need to happen. I haven't yet gained sufficient clarity here to explain, so I will skip over it now.

How does ls do what it does?

At this stage, the program is in memory and is ready to execute when it gets a chance. So, how does ls read the directories and files from disk and what happens in the kernel space to make that happen? This is what we focus on now. ls uses readdir(3) function to read the directory contents which in turn invokes the getdents() system call defined as follows in fs/readdir.c:

SYSCALL_DEFINE3(getdents, unsigned int, fd, struct linux_dirent __user *, dirent, unsigned int,count)   

To summarize what we are looking to observe here is that the directory entries are ready from the block device (the underlying filesystem), which involves look up in the filesystem's inode table. My filesystem is formatted with btrfs which is compiled as a kernel module and I trace two functions from this module - btrfs_lookup_dentry() which I am fairly certain is doing the look up and the btrfs_real_readdir() function which is reading the directory from the underlying block device.

Once a directory entry is read from the block device, the function filldir() in fs/readdir.c is called (from the likes of it as a callback function) where I believe the in-memory directory entry is being created (This is not specific to the underlying filesystem being btrfs - i have observed this with ext4 too).

The following SystemTap script traces the above functions in addition to the vfs_write() function which finally writes the output to the terminal:

probe module("btrfs").function("btrfs_lookup_dentry").call
{
    if(execname() == "ls")
       printf("%s -> %s\n", thread_indent(1), probefunc());
}

probe module("btrfs").function("btrfs_lookup_dentry").return
{
    if(execname() == "ls")
        printf("%s <- %s\n", thread_indent(-1), probefunc());
}

probe module("btrfs").function("btrfs_real_readdir").call
{
    if(execname() == "ls")    printf("%s -> %s\n", thread_indent(1), probefunc());
} 

probe module("btrfs").function("btrfs_real_readdir").return
{
    if(execname() == "ls")
        printf("%s <- %s\n", thread_indent(-1), probefunc());
}

# for ext4
probe kernel.function("ext4_readdir@fs/ext4/dir.c").call
{
    if(execname() == "ls")
        printf("%s -> %s\n", thread_indent(1), probefunc());
}

probe kernel.function("ext4_readdir@fs/ext4/dir.c").return
{
    if(execname() == "ls")
        printf("%s <- %s\n", thread_indent(-1), probefunc());
}

probe kernel.function("filldir@fs/readdir.c").call
{
    if(execname() == "ls")
        printf("%s -> %s : %s\n", thread_indent(1), probefunc(), kernel_string($name));
}

probe kernel.function("filldir@fs/readdir.c").return
{
    if(execname() == "ls")
        printf("%s <- %s\n", thread_indent(-1), probefunc());
}

probe kernel.function("vfs_write@fs/read_write.c").call
{
    if(execname() == "ls")
        printf("%s -> %s\n", thread_indent(1), probefunc());
}

probe kernel.function("vfs_write@fs/read_write.c").return
{
    if(execname() == "ls")
        printf("%s <- %s\n", thread_indent(-1), probefunc());
}

Run this SystemTap script and execute ls in another terminal window and you should see in the SystemTap window lines such as these:

0 ls(11381): -> btrfs_lookup_dentry
32 ls(11381): <- btrfs_lookup
 0 ls(11381): -> btrfs_lookup_dentry
15 ls(11381): <- btrfs_lookup
 0 ls(11381): -> btrfs_real_readdir
 5 ls(11381):  -> filldir : .
11 ls(11381):  <- btrfs_real_readdir
15 ls(11381):  -> filldir : ..
18 ls(11381):  <- btrfs_real_readdir
37 ls(11381):  -> filldir : sources�
41 ls(11381):  <- btrfs_real_readdir
46 ls(11381):  -> filldir : notes.rst
50 ls(11381):  <- btrfs_real_readdir
54 ls(11381):  -> filldir : ftrace_demo.c
58 ls(11381):  <- btrfs_real_readdir
63 ls(11381):  -> filldir : ftrace_demo.c
.. 
..

0 ls(11381): -> vfs_write
31 ls(11381): <- sys_write
 0 ls(11381): -> vfs_write
17 ls(11381): <- sys_write

The last few lines indicate the output of ls being written to the terminal.

See also

Some random links I came across while researching:

@vishsahu
Copy link

Hi, Nice work ! I'm just not able to get why there was no fork(2) called because if shell as a process internally calls execev(2), if succeeds, it'll never return to shell prompt, worst where subsequent vfs_write output will go? Since you've tested this entire flow, there are 2 options: either there was fork() called where everything happened including execev() in child or I am missing something important here.

@runderwo
Copy link

If the shell is invoked with a single executable as the argument it will directly exec (probably because forking would serve no purpose). You can see the forking behavior by creating a shell script with the sole contents 'ls' and strace-ing it. Look for the clone() syscall. You can also invoke bash through gdb, break on fork and enter 'ls' into the interactive shell yourself to see the caught fork()

@sahilshekhawat
Copy link

sahilshekhawat commented Jun 10, 2016

If shell directly exec then its address space is replaced by the new process. Thats the way fork-exec were designed. I know forking doesn't make any sense when it is directly doing exec but kernel is actually aware of this and does COW (Copy on write) and does checks if an exec is being called right after fork BUT shell does run fork, otherwise, It cannot return back.

@Bhushan-Jagtap-2013
Copy link

Nicely Explained !!

@khireddine10
Copy link

great explanation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment