shumbo/Project 5: Container.md

## Project 5: Container.md

      
    Raw
  

              Project 5: Container.md
            
          
    Project 5: Container

GOAL: This project will sum up all the knowledge you have learned about operating systems and create a small container runtime that can run a containerized process on Linux.
Introduction to Container

What is a Container?

Throughout the course, we have been using Docker and Dev Container to run our code. But what exactly is a container?
A container is a lightweight and isolated execution environment that encapsulates an application and its dependencies. It provides a consistent and reproducible environment across different systems, allowing applications to run reliably regardless of the underlying infrastructure.
Container Image

Containers are created from container images, which are self-contained packages that include the application code, runtime, system tools, libraries, and configuration files required for the application to run.
Container Runtime

A container runtime is responsible for managing the lifecycle of containers. It provides an interface between the container and the host operating system, orchestrating the necessary resources and ensuring isolation and security within the container environment.
Providing Isolation

One of the most important features of container runtimes is to provide isolation. That is, whatever happens inside a container does not affect the host system or other containers.
Modern container runtimes provide isolation for many OS abstractions, such as CPU, memory, network, etc. In this project, we focus on two essential abstractions: processes and files.
Isolating Processes: PID Namespace

Our container runtime should provide isolation for processes. For example, the processes running on the host should not be visible to the processes inside the container.
Let's check how Docker isolates processes. We use the ps command to see the information about running processes.
If you execute ps -A, it lists all processes that are running on the system.
    PID TTY          TIME CMD
      1 ?        06:30:08 systemd
      2 ?        00:00:44 kthreadd
      3 ?        00:00:00 rcu_gp
      4 ?        00:00:00 rcu_par_gp
      6 ?        00:00:00 kworker/0:0H-kblockd
      ...

This is the result of ps -A on my server. We can see many processes are running.
Let us create a new Docker container and execute the same command inside the container:
$ docker run --rm -it alpine
$ ps -A
PID   USER     TIME  COMMAND
    1 root      0:00 /bin/sh
    7 root      0:00 ps -A

The first command creates a new Docker container using the alpine image. It opens a shell inside the container, so we run ps -A. Even though many processes are running on the host system, they are not visible inside the container.
In Linux, the isolation of processes is achieved using the PID namespace. A process ID (PID) is a unique number that identifies a running process. Every process belongs to a PID namespace and it can only see processes in the same PID namespace.
Isolating Filesystem: Overlay Filesystem

Our container runtime also supports isolating the filesystem. Each container has its own filesystem and cannot affect the host or other containers.
Let's check how Docker provides filesystem isolation. We use the same alpine image.
In one terminal, let us create a container and create a file foo in its root directory.
$ docker run --rm -it alpine
$ echo foo > foo
$ ls /
bin    etc    home   media  opt    root   sbin   sys    usr
dev    foo    lib    mnt    proc   run    srv    tmp    var

Open another terminal, create a new container, and run ls.
$ docker run --rm -it alpine
$ ls /
bin    etc    lib    mnt    proc   run    srv    tmp    var
dev    home   media  opt    root   sbin   sys    usr

Even though both containers are created from the same image, foo is not visible in the new container.
One way to achieve this is to use the Overlay Filesystem. The Overlay filesystems is a type of union filesystem that allows multiple directories to be mounted together, presenting a single unified view. It provides a way to overlay a read-write filesystem on top of a read-only filesystem, creating a combined view that appears as a single coherent filesystem.
When a file or directory is accessed, the Overlay filesystem looks for it in the topmost layer first. If the file is found, it is returned. If not, the filesystem searches the lower layers in a specific order until it locates the file. This allows modifications to be made to the topmost layer, while the lower layers remain unchanged. Changes made to the topmost layer are stored separately, without modifying the underlying read-only layers.
Let's see how the overlay filesystem works with a simple example.
To use the overlay filesystem, we need three directories: lower, upper, and work. lower will be a read-only directory that provides an "image", upper will store all changes on top of lower, and work is used by the overlay filesystem as a workspace.
First, we use the following commands to create the directories. merged is a directory where we will mount the overlay filesystem at.
mkdir lower upper work merged
Then, we create a read-only file in the lower directory.
echo "this is in lower" > lower/foo
Now, we are ready to create a new overlay filesystem. Use the following command to create an overlay filesystem and mount it as merged.
mount -t overlay overlay -o lowerdir=lower,upperdir=upper,workdir=work merged
Now, we mounted the overlay filesystem to merged. It provides the unified view of lower and upper.
$ ls merged
foo
$ cat merged/foo
this is in lower

Let's make some change to the file in the overlay filesystem.
echo "new foo" > merged/foo
Because lower/foo is read-only, it cannot be modified. Instead, the overlay filesystem creates upper/foo with the updated content. Because upper/foo now exists, merged/foo refers to the updated content at upper/foo.
$ echo merged/foo
new foo
$ echo upper/foo
new foo
$ echo lower/foo
foo

To provide an isolated filesystem for each container, we make an overlay filesystem. The lower directory is the container image that stores all files and directories needed for that container. Each container gets a unique upper directory, so changes by the container do not affect other containers.
Implementing container.c

The goal of this project is to complete container.c. It is capable of creating a container from an image and executing a command inside the container.
Creating an image directory

In container.c, an image is a directory under ./images that stores all the files and directories required for the system. You can think of an image directory as a snapshot of the system root directory.
The easiest way to create an image is to use the docker export command. Let us create an image directory from the alpine docker image.
First, we create a Docker container using docker run --rm -it alpine sh. This open a shell inside the newly created container.
Second, we need to get the ID of the container. Open a new terminal and run docker ps to see the list of running Docker containers.
$ docker ps
CONTAINER ID   IMAGE     COMMAND   CREATED         STATUS         PORTS     NAMES
f1cf18783484   alpine    "sh"      2 minutes ago   Up 2 minutes             boring_sanderson

Find the container and copy the container ID (f1cf18783484 in this case).
Then, we run docker export {container ID} > alpine.tar to create a tarball of the image.
docker export f1cf18783484 > alpine.tar
Finally, we extract files in the tarball to ./images/{image name} using the following commands.
$ mkdir images/alpine
$ tar -xf alpine.tar -C images/alpine
$ ls images/alpine
bin  dev  etc  home  lib  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var

Now, we have the image directory called alpine. We use this identifier to specify the image.
Command-line Interface

container takes three or more arguments.
$ ./container
Usage: ./container [ID] [IMAGE] [CMD]...


The first argument (ID) specifies the unique ID of the container.

Docker assigns this at random, but we require the user to provide one .
The ID can be at most 16 characters (CONTAINER_ID_MAX).


The second argument (IMAGE) specifies the image to create a container from.

./images/{IMAGE} must exist and store all files required for this container


The rest of the arguments specify the commands to run inside the container.

It can be more than one as the user might provide options.


For example, ./container my-container alpine echo "hello world" will

Create a container with ID my-container
Use the image located at ./images/alpine
Execute echo "hello world" inside the container

sudo ./container my-container alpine echo "hello world"
hello world

main

You will need to complete two functions in container.c: main and container_exec.
main is the entry point of the command-line interface. It needs to parse the command-line arguments (argv) and create a child process by calling clone with appropriate parameters.
int clone_flags = SIGCHLD | CLONE_NEWNS | CLONE_NEWPID;
int pid = clone(container_exec, &child_stack, clone_flags, &container);
clone works similarly to the fork system call and creates a child process. The child process executes the container_exec function and takes container as an argument, just like how we passed arguments to a new thread in pthread_create. By passing three flags, the child process will have separate PID and mount namespaces that provide isolation.
Add fields to the container struct and fill values in main so container_exec will have enough information to create a container from an image and execute the command.
container_exec

main executes container_exec in a child process with separate PID and mount namespaces. container_exec needs to

create and mount an overlay filesystem
call change_root
use execvp to run the command

Creating an overlay filesystem

container_exec needs to create an overlay filesystem. The merged directory will have everything inside the image directory plus the changes made inside the container, and will be used as a root of the filesystem inside the container.
To create an overlay filesystem, use the mount function.
int mount(const char *source, const char *target,
                 const char *filesystemtype, unsigned long mountflags,
                 const void *data);

source is often a path referring to a device. Because we are not mounting a device, use the dummy string "overlay"
target specifies the directory at which to create the mount point. Use the merged directory path: /tmp/container/{id}/merged.
filesystemtype specifies the type of the filesystem. Use "overlay".
mountflags provides options. Use MS_RELATIME.
data provides options specific to the filesystem. The overlay filesystem takes the three arguments (lowerdir, upperdir, workdir) in the format: lowerdir={lowerdir},upperdir={upperdir},workdir={workdir}. Construct a string of this format and pass the pointer.

lowerdir should be the image directory. In principle, upperdir and workdir can be any directory, but in order for the overlay filesystem to work inside the Dev Container, those directories must be inside /tmp/container. main creates this directory.
Use /tmp/container/{id}/lower, /tmp/container/{id}/work for lowerdir and workdir, respectively. In order for mount to work, those directories must exist. Use mkdir to create a directory if not exist.
For example, if the current directory is /workspaces/project5-container, the container ID is my-container, and image name is alpine, you need to call
mount(
    "overlay",
    "/tmp/container/my-container/merged",
    "overlay",
    MS_RELATIVE,
    "lowerdir=/workspaces/project5-container/images/alpine,upperdir=/tmp/container/my-container/upper,workdir=/tmp/container/my-container/workdir"
);
Change the root mount

Now, the overlay filesystem mounted at /tmp/container/{id}/merged. We want the child process to treat this as the root directory.
pivot_root is the system call to achieve this. Because calling pivot_root is complex and tedious, we provided a helper function to do so.
void change_root(const char* path)
Provide the path to the "merged" directory to the change_root function. It will call pivot_root to change the root directory to the "merged" directory and ensure it cannot access outside directories.
change_root also does a couple of more things to ensure the container works properly, such as setting the PATH environment variable.
Execute the command

At this point, the child process has its own PID namespace and the overlay filesystem as its root directory. The last step is to execute the specified command.
Use execvp(3) so it can execute commands without specifying the full path to the executable.
int execvp(const char *file, char *const argv[]);
file specifies the name of the command and argv specifies the entire arguments. argv needs to be null-terminated.
For example, if the command is echo "hello world", you should call
char *argument_list[] = {"echo", "hello world", NULL};
execvp(argument_list[0], argument_list);
Testing

Testing with the alpine image

You can test your container runtime is working properly using the alpine image described earlier. In particular, we want to make sure processes and filesystem are isolated.
To check the process isolation, we can use the ps command.
$ sudo ./container my-container alpine sh
--- inside container ---
$ ps -A
PID   USER     TIME  COMMAND
    1 root      0:00 sh
    2 root      0:00 ps -A

ps -A must not print the processes running on host. The command used to create the container (sh in the above example) should have PID 1.
To check the file system isolation, you can use cd inside the container to try to get out of the file system. If change_root is called properly, you should not be able to get out of the overlay filesystem.
$ sudo ./container my-container alpine sh
# inside container
$ cd /../../
$ ls
bin    dev    etc    home   lib    media  mnt    opt    proc   root   run    sbin   srv    sys    tmp    usr    var

Any changes made inside the container should be visible at the upper directory.
$ sudo ./container my-container alpine sh
--- inside container ---
$ echo hello from container > hello.txt
$ exit
--- returned to host ---
$ sudo cat /tmp/container/my-container/upper/hello.txt
hello from container

Testing with other images

While our container runtime is minimal, it is capable of running a variety of images. Once you are done testing with the alpine image, try using your container runtime to execute your favorite image.
For example, here is how to execute JavaScript (Node.js) using the node:18-alpine image.
# follow similar steps to create an image directory
$ docker pull node:18-alpine
$ docker run --rm -it node:18-alpine sh
# in a different terminal
$ docker ps # copy the container ID
$ docker export {container-id} > node.tar
$ mkdir images/node
$ tar -xf node.tar -C images/node

$ sudo ./container node-container node node
Welcome to Node.js v18.16.0.
Type ".help" for more information.
> 
Error: Could not open history file.
REPL session history will not be persisted.
> console.log("hello, world!")
hello, world!
undefined
>
Try running your favorite programming language with your container runtime!
Notes


make must create the container executable.
All source files must be formatted using clang-format. Run make format to format .c and .h files.
The filesystems can often enter wrong states. If filesystems behave weirdly, try running the "Dev Containers: Rebuild Container" command in VS Code. This will recreate the Dev Container and likely to resolve the issue. You can also try restarting Docker Desktop.

Acknowledgements


Kernel OverlayFS document
ianlewis/execc: A simple container runtime in bash
Linux containers from scratch - diyC
Container Creation Using Namespaces and Bash | Nicolas Mesa