Skip to content

Instantly share code, notes, and snippets.

@paulproteus
Last active November 1, 2023 16:39
Show Gist options
  • Save paulproteus/356a6c008ec956ca29742e1b95997a61 to your computer and use it in GitHub Desktop.
Save paulproteus/356a6c008ec956ca29742e1b95997a61 to your computer and use it in GitHub Desktop.
nsjail within Docker (aarch64)

Overview

This document explains some risks of server-side image processing and explains a technique to make that much safer. I recommend using this technique.

Strategy

For a web app that is running in Docker, it's helpful to delegate work such as image conversion to a subprocess. We can confine subprocesses so they can only access non-sensitive data by using Linux security features while running them in the same Docker container as the full web app. This allows for complete mitigation of security issues in the subprocesses with maximum convenience and minimal slowdown.

Every few years, complex packages like imagemagick have critical security bugs; people find about one issue per month although most are not security-critical. One recent critical one was found in 2019.

We use three techniques are required to accomplish good isolation, all provided by Google nsjail, all of which work within a Docker container in my tests.

  • Filtering the Linux syscalls that the dangerous program can make. This is also known as launching oblivious external processes in an environment confined by seccomp.
  • Clearing environment variables. This is important because app secrets are often stored in environment variables.
  • Using a non-root user ID. This is important so we can use UNIX file permissions to deny access to sensitive paths (e.g. /app if your app code should stay private).

You can also call it a 'sensitive compartmented imagemagick facility' or 'scif'. The example here is ImageMagick, but I would recommend applying it to any program that processes user-submitted data, such as a PDF tool.

Demonstrations

When a confined process tries to access the network, it gets an error message. Here, we run curl through nsjail attempting to visit 1.1.1.1 on the web; curl cannot connect to the server.

root@9b15d7674d1d:/tmp/nsjail# ./nsjail --seccomp_string 'ALLOW { exit, socketpair, pipe2, futex, statfs, getdents64, getrandom, prlimit64, readlinkat, execve, write, execve, brk, mmap, openat, newfstat, newfstatat, faccessat, close, read, mprotect, munmap, getuid, getgid, getpid, rt_sigaction, geteuid, getppid, getcwd, getegid, ioctl, fcntl, clone, wait4, rt_sigreturn, exit_group, set_tid_address, set_robust_list, getgroups, lseek, getpgid, pselect6, fcntl } DEFAULT ERRNO(11)'  --user 99 --gid 99 --disable_proc --no_pivotroot --mode e --disable_clone_newuser --disable_clone_newuser --disable_clone_newpid --disable_clone_newnet --disable_clone_newns --disable_clone_newipc --disable_clone_newcgroup  --disable_clone_newuts  -- /usr/bin/curl http://1.1.1.1/
[I][2022-12-29T18:27:49+0000] Mode: STANDALONE_EXECVE
[I][2022-12-29T18:27:49+0000] Jail parameters: hostname:'NSJAIL', chroot:'', process:'/usr/bin/curl', bind:[::]:0, max_conns:0, max_conns_per_ip:0, time_limit:0, personality:0, daemonize:false, clone_newnet:false, clone_newuser:false, clone_newns:false, clone_newpid:false, clone_newipc:false, clone_newuts:false, clone_newcgroup:false, clone_newtime:false, keep_caps:false, disable_no_new_privs:false, max_cpus:0
[I][2022-12-29T18:27:49+0000] Mount: '/' flags:MS_RDONLY type:'tmpfs' options:'' dir:true
[I][2022-12-29T18:27:49+0000] Uid map: inside_uid:99 outside_uid:0 count:1 newuidmap:false
[I][2022-12-29T18:27:49+0000] Gid map: inside_gid:99 outside_gid:0 count:1 newgidmap:true
[I][2022-12-29T18:27:49+0000] Executing '/usr/bin/curl' for '[STANDALONE MODE]'
curl: (7) Couldn't connect to server

Image conversion still works properly. On this system, we have a PNG file in /tmp/png.png and are trying to convert it to JPG.

root@9b15d7674d1d:/tmp/nsjail# ./nsjail --seccomp_string 'ALLOW { exit, socketpair, pipe2, futex, statfs, getdents64, getrandom, prlimit64, readlinkat, execve, write, execve, brk, mmap, openat, newfstat, newfstatat, faccessat, close, read, mprotect, munmap, getuid, getgid, getpid, rt_sigaction, geteuid, getppid, getcwd, getegid, ioctl, fcntl, clone, wait4, rt_sigreturn, exit_group, set_tid_address, set_robust_list, getgroups, lseek, getpgid, pselect6, fcntl } DEFAULT ERRNO(11)'  --user 99 --gid 99 --disable_proc --no_pivotroot --mode e --disable_clone_newuser --disable_clone_newuser --disable_clone_newpid --disable_clone_newnet --disable_clone_newns --disable_clone_newipc --disable_clone_newcgroup  --disable_clone_newuts  -- /usr/bin/convert -verbose /tmp/png.png /tmp/jpg.jpg 
[I][2022-12-29T18:30:54+0000] Mode: STANDALONE_EXECVE
[I][2022-12-29T18:30:54+0000] Jail parameters: hostname:'NSJAIL', chroot:'', process:'/usr/bin/convert', bind:[::]:0, max_conns:0, max_conns_per_ip:0, time_limit:0, personality:0, daemonize:false, clone_newnet:false, clone_newuser:false, clone_newns:false, clone_newpid:false, clone_newipc:false, clone_newuts:false, clone_newcgroup:false, clone_newtime:false, keep_caps:false, disable_no_new_privs:false, max_cpus:0
[I][2022-12-29T18:30:54+0000] Mount: '/' flags:MS_RDONLY type:'tmpfs' options:'' dir:true
[I][2022-12-29T18:30:54+0000] Uid map: inside_uid:99 outside_uid:0 count:1 newuidmap:false
[I][2022-12-29T18:30:54+0000] Gid map: inside_gid:99 outside_gid:0 count:1 newgidmap:true
[I][2022-12-29T18:30:54+0000] Executing '/usr/bin/convert' for '[STANDALONE MODE]'
/tmp/png.png PNG 800x600 800x600+0+0 8-bit sRGB 226933B 76859.160u 0:00.007
/tmp/png.png=>/tmp/jpg.jpg PNG 800x600 800x600+0+0 8-bit sRGB 74131B 73486.250u 0:00.007

The final line of output shows the successful creation of a 800x600 JPG file.

Implementation advice

If you want to follow this approach, you'll need to do a few things.

  • If your Docker containers are running on x86_64, you may have to change the --seccomp_string parameter. See the appendix on one approach to figuring out a good list.
  • You'll have to compile nsjail as part of your app's Dockerfile. You can git clone it from https://github.com/google/nsjail ; to cover its dependencies, run apt -y install build-essential pkg-config flex bison libnl-route-3-dev. You can use a multi-stage Docker build to compile the nsjail program and put it in /usr/local/bin.
  • Pick a user ID and group ID to run the confined program. I suggest 65534 because that's the output of id nobody in Ubuntu/Debian.
  • In your Dockerfile, use chmod 0700 to limit read access to directories with sensitive data, which may include /app depending on your situation.
  • This approach does not limit RAM usage; it would be smart to do that with ulimit.

You may deviate from this plan; here's a roadmap of other options.

  • You can use other tools/techniques to set up the seccomp-bpf sandbox. Cloudflare has a seccomp sandbox tool, and you can do it yourself in C. I prefer nsjail because it receives active maintenance. nsjail's primary purpose is to create Linux namespaces for isolation; those are usually forbidden within Docker, but it also allows us to easily use seccomp-bpf, which does what we need.
  • You can use other tools/techniques to isolate the program. I'm excited about using WebAssembly as a sandbox someday.
  • You can launch the dangerous program in its own Docker container. I find that approach slow, cumbersome, and risky. nsjail launches a subprocess in 20ms, but launching Docker containers is usually slower than that. It's cumbersome because now you have to think about deploying a different service. It's risky because even if you put convert in its own Docker container, you will still need some way to filter out network connectivity, environment variable data, and sensitive files.

Keep in mind that if you reformat images in-process with something like libvips-ruby, confinement becomes vastly more challenging. nsjail can only apply confinement rules to subprocesses.

Appendix: Writing a good seccomp-bpf profile

There are a few important aspects of a seccomp-bpf profile.

  • It should block dangerous features like networking. This includes directly via the socket or connect syscalls, and indirectly via io_uring.
  • It should allow your desired program to function correctly.

This is an allow list, which is safer than a deny list, but creates the risk of blocking legitimate behavior that wasn't analyzed when creating the allow-list. It may be useful to add a test to the app test suite that shows that the desired programs work properly within the confinement.

Here's the command I use to see what syscalls might be needed by a subprocess, in this case curl:

strace -f ./nsjail --seccomp_string 'ALLOW { exit, socketpair, pipe2, futex, statfs, getdents64, getrandom, prlimit64, readlinkat, execve, write, execve, brk, mmap, openat, newfstat, newfstatat, faccessat, close, read, mprotect, munmap, getuid, getgid, getpid, rt_sigaction, geteuid, getppid, getcwd, getegid, ioctl, fcntl, clone, wait4, rt_sigreturn, exit_group, set_tid_address, set_robust_list, getgroups, lseek, getpgid, pselect6, fcntl } DEFAULT ERRNO(11)'  --user 99 --gid 99 --disable_proc --no_pivotroot --mode e --disable_clone_newuser --disable_clone_newuser --disable_clone_newpid --disable_clone_newnet --disable_clone_newns --disable_clone_newipc --disable_clone_newcgroup  --disable_clone_newuts  -- /usr/bin/curl http://checkip.dyndns.org/ 2>&1 | grep -i again

The grep -i again allows us to see just the syscalls that were rejected due to the syscall filter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment