This document explains some risks of server-side image processing and explains a technique to make that much safer. I recommend using this technique.
For a web app that is running in Docker, it's helpful to delegate work such as image conversion to a subprocess. We can confine subprocesses so they can only access non-sensitive data by using Linux security features while running them in the same Docker container as the full web app. This allows for complete mitigation of security issues in the subprocesses with maximum convenience and minimal slowdown.
Every few years, complex packages like imagemagick have critical security bugs; people find about one issue per month although most are not security-critical. One recent critical one was found in 2019.
We use three techniques are required to accomplish good isolation, all provided by Google nsjail, all of which work within a Docker container in my tests.
- Filtering the Linux syscalls that the dangerous program can make. This is also known as launching oblivious external processes in an environment confined by seccomp.
- Clearing environment variables. This is important because app secrets are often stored in environment variables.
- Using a non-root user ID. This is important so we can use UNIX file permissions to deny access to sensitive paths (e.g.
/app
if your app code should stay private).
You can also call it a 'sensitive compartmented imagemagick facility' or 'scif'. The example here is ImageMagick, but I would recommend applying it to any program that processes user-submitted data, such as a PDF tool.
When a confined process tries to access the network, it gets an error message. Here, we run curl
through nsjail
attempting to visit 1.1.1.1 on the web; curl cannot connect to the server.
root@9b15d7674d1d:/tmp/nsjail# ./nsjail --seccomp_string 'ALLOW { exit, socketpair, pipe2, futex, statfs, getdents64, getrandom, prlimit64, readlinkat, execve, write, execve, brk, mmap, openat, newfstat, newfstatat, faccessat, close, read, mprotect, munmap, getuid, getgid, getpid, rt_sigaction, geteuid, getppid, getcwd, getegid, ioctl, fcntl, clone, wait4, rt_sigreturn, exit_group, set_tid_address, set_robust_list, getgroups, lseek, getpgid, pselect6, fcntl } DEFAULT ERRNO(11)' --user 99 --gid 99 --disable_proc --no_pivotroot --mode e --disable_clone_newuser --disable_clone_newuser --disable_clone_newpid --disable_clone_newnet --disable_clone_newns --disable_clone_newipc --disable_clone_newcgroup --disable_clone_newuts -- /usr/bin/curl http://1.1.1.1/
[I][2022-12-29T18:27:49+0000] Mode: STANDALONE_EXECVE
[I][2022-12-29T18:27:49+0000] Jail parameters: hostname:'NSJAIL', chroot:'', process:'/usr/bin/curl', bind:[::]:0, max_conns:0, max_conns_per_ip:0, time_limit:0, personality:0, daemonize:false, clone_newnet:false, clone_newuser:false, clone_newns:false, clone_newpid:false, clone_newipc:false, clone_newuts:false, clone_newcgroup:false, clone_newtime:false, keep_caps:false, disable_no_new_privs:false, max_cpus:0
[I][2022-12-29T18:27:49+0000] Mount: '/' flags:MS_RDONLY type:'tmpfs' options:'' dir:true
[I][2022-12-29T18:27:49+0000] Uid map: inside_uid:99 outside_uid:0 count:1 newuidmap:false
[I][2022-12-29T18:27:49+0000] Gid map: inside_gid:99 outside_gid:0 count:1 newgidmap:true
[I][2022-12-29T18:27:49+0000] Executing '/usr/bin/curl' for '[STANDALONE MODE]'
curl: (7) Couldn't connect to server
Image conversion still works properly. On this system, we have a PNG file in /tmp/png.png and are trying to convert it to JPG.
root@9b15d7674d1d:/tmp/nsjail# ./nsjail --seccomp_string 'ALLOW { exit, socketpair, pipe2, futex, statfs, getdents64, getrandom, prlimit64, readlinkat, execve, write, execve, brk, mmap, openat, newfstat, newfstatat, faccessat, close, read, mprotect, munmap, getuid, getgid, getpid, rt_sigaction, geteuid, getppid, getcwd, getegid, ioctl, fcntl, clone, wait4, rt_sigreturn, exit_group, set_tid_address, set_robust_list, getgroups, lseek, getpgid, pselect6, fcntl } DEFAULT ERRNO(11)' --user 99 --gid 99 --disable_proc --no_pivotroot --mode e --disable_clone_newuser --disable_clone_newuser --disable_clone_newpid --disable_clone_newnet --disable_clone_newns --disable_clone_newipc --disable_clone_newcgroup --disable_clone_newuts -- /usr/bin/convert -verbose /tmp/png.png /tmp/jpg.jpg
[I][2022-12-29T18:30:54+0000] Mode: STANDALONE_EXECVE
[I][2022-12-29T18:30:54+0000] Jail parameters: hostname:'NSJAIL', chroot:'', process:'/usr/bin/convert', bind:[::]:0, max_conns:0, max_conns_per_ip:0, time_limit:0, personality:0, daemonize:false, clone_newnet:false, clone_newuser:false, clone_newns:false, clone_newpid:false, clone_newipc:false, clone_newuts:false, clone_newcgroup:false, clone_newtime:false, keep_caps:false, disable_no_new_privs:false, max_cpus:0
[I][2022-12-29T18:30:54+0000] Mount: '/' flags:MS_RDONLY type:'tmpfs' options:'' dir:true
[I][2022-12-29T18:30:54+0000] Uid map: inside_uid:99 outside_uid:0 count:1 newuidmap:false
[I][2022-12-29T18:30:54+0000] Gid map: inside_gid:99 outside_gid:0 count:1 newgidmap:true
[I][2022-12-29T18:30:54+0000] Executing '/usr/bin/convert' for '[STANDALONE MODE]'
/tmp/png.png PNG 800x600 800x600+0+0 8-bit sRGB 226933B 76859.160u 0:00.007
/tmp/png.png=>/tmp/jpg.jpg PNG 800x600 800x600+0+0 8-bit sRGB 74131B 73486.250u 0:00.007
The final line of output shows the successful creation of a 800x600 JPG file.
If you want to follow this approach, you'll need to do a few things.
- If your Docker containers are running on x86_64, you may have to change the
--seccomp_string
parameter. See the appendix on one approach to figuring out a good list. - You'll have to compile
nsjail
as part of your app'sDockerfile
. You can git clone it from https://github.com/google/nsjail ; to cover its dependencies, runapt -y install build-essential pkg-config flex bison libnl-route-3-dev
. You can use a multi-stage Docker build to compile thensjail
program and put it in/usr/local/bin
. - Pick a user ID and group ID to run the confined program. I suggest 65534 because that's the output of
id nobody
in Ubuntu/Debian. - In your Dockerfile, use
chmod 0700
to limit read access to directories with sensitive data, which may include/app
depending on your situation. - This approach does not limit RAM usage; it would be smart to do that with
ulimit
.
You may deviate from this plan; here's a roadmap of other options.
- You can use other tools/techniques to set up the seccomp-bpf sandbox. Cloudflare has a seccomp sandbox tool, and you can do it yourself in C. I prefer nsjail because it receives active maintenance. nsjail's primary purpose is to create Linux namespaces for isolation; those are usually forbidden within Docker, but it also allows us to easily use seccomp-bpf, which does what we need.
- You can use other tools/techniques to isolate the program. I'm excited about using WebAssembly as a sandbox someday.
- You can launch the dangerous program in its own Docker container. I find that approach slow, cumbersome, and risky. nsjail launches a subprocess in 20ms, but launching Docker containers is usually slower than that. It's cumbersome because now you have to think about deploying a different service. It's risky because even if you put
convert
in its own Docker container, you will still need some way to filter out network connectivity, environment variable data, and sensitive files.
Keep in mind that if you reformat images in-process with something like libvips-ruby, confinement becomes vastly more challenging. nsjail
can only apply confinement rules to subprocesses.
There are a few important aspects of a seccomp-bpf profile.
- It should block dangerous features like networking. This includes directly via the
socket
orconnect
syscalls, and indirectly via io_uring. - It should allow your desired program to function correctly.
This is an allow list, which is safer than a deny list, but creates the risk of blocking legitimate behavior that wasn't analyzed when creating the allow-list. It may be useful to add a test to the app test suite that shows that the desired programs work properly within the confinement.
Here's the command I use to see what syscalls might be needed by a subprocess, in this case curl
:
strace -f ./nsjail --seccomp_string 'ALLOW { exit, socketpair, pipe2, futex, statfs, getdents64, getrandom, prlimit64, readlinkat, execve, write, execve, brk, mmap, openat, newfstat, newfstatat, faccessat, close, read, mprotect, munmap, getuid, getgid, getpid, rt_sigaction, geteuid, getppid, getcwd, getegid, ioctl, fcntl, clone, wait4, rt_sigreturn, exit_group, set_tid_address, set_robust_list, getgroups, lseek, getpgid, pselect6, fcntl } DEFAULT ERRNO(11)' --user 99 --gid 99 --disable_proc --no_pivotroot --mode e --disable_clone_newuser --disable_clone_newuser --disable_clone_newpid --disable_clone_newnet --disable_clone_newns --disable_clone_newipc --disable_clone_newcgroup --disable_clone_newuts -- /usr/bin/curl http://checkip.dyndns.org/ 2>&1 | grep -i again
The grep -i again
allows us to see just the syscalls that were rejected due to the syscall filter.