Skip to content

Instantly share code, notes, and snippets.

What would you like to do?
fork() is evil; vfork() is goodness; afork() would be better; clone() is stupid

I recently happened upon an implementation of popen() (different API, same idea) using clone(2), and so I opened an issue requesting use of vfork(2) or posix_spawn() for portability. It turns out that on Linux there's an important advantage to using clone(2). I think I should capture the things I wrote there in a better place. A gist, a blog, whatever.

So here goes.

Long ago, I, like many Unix fans, thought that fork(2) and the fork-exec process spawning model were the greatest thing, and the Windows sucked for only having exec*() and _spawn*(), the last being a Windows-ism.

After many years of experience, I learned that fork(2) is in fact evil. And vfork(2), long said to be evil, is in fact goodness. A slight variant of vfork(2) that avoids the need to block the parrent would be even better (see below).

Extraordinary statements require explanation, so allow me to explain.

I won't bother explaining what fork(2) is -- if you're reading this, I assume you know. But I'll explain vfork(2) and why it was said to be harmful. vfork(2) is very similar to fork(2), but the new process it creates runs in the same address space as the parent as if it were a thread, even sharing the same stack as the thread that called vfork(2)! Two threads can't share a stack, so the parent is stopped while the child does its thing: either exec*(2) or _exit(2).

Now, 3BSD added vfork(2), and a few years later 4.4BSD removed it as it was by then considered harmful. Most subsequent man pages say as much. But the derivatives of 4.4BSD restored it and do not call it harmful. There's a reason for this: vfork(2) is much cheaper than fork(2) -- much, much cheaper. That's because fork(2) has to either copy the parent's address space, or arrange for copy-on-write (which is supposed to be an optimization to avoid unnecessary copies). But even COW is very expensive because it requires modifying memory mappings, taking expensive page faults, and so on. Modern kernels tend to seed the child with a copy of the parent's resident set, but if the parent has a large memory footprint (e.g., is a JVM), then the RSS will be huge. So fork(2) is inescapably expensive except for small programs with small footprints (e.g., a shell).

So you begin to see why fork(2) is evil. And I haven't yet gotten to fork-safety perils! Fork-safety considerations are a lot like thread-safety, but it is harder to make libraries fork-safe than thread-safe. I'm not going to go into fork-safety here: it's not necessary.

(Before I go on I should admit to hypocrisy: I do write code that uses fork(2), often for multi-processing daemons -- as opposed to multi-threading, though I often do the latter as well. But the forks there happen very early on when nothing fork-unsafe has happened yet and the address space is small, thus avoiding most evils of fork(2). vfork(2) cannot be used for this purpose. On Windows one would have to CreateProcess() or _spawn() to implement multi-processed daemons, which is a huge pain in the neck.)

Why did I ever think fork(2) was elegant then? It was the same reason that everyone else did and does: CreateProcess*(), _spawn() and posix_spawn() and such functions are extremely complex, and they have to be because there is an enormous number of things one might do between fork() and exec() in, say, a shell. But with fork() and exec() one does not need a language or API that can express all those things: the host language will do! fork(2) gave the Unix's creators the ability to move all that complexity out of kernel-land into user-land, where it's much easier to develop software -- it made them more productive, perhaps much more so. The price Unix's creators paid for that elegance was the need to copy address spaces. Since back then programs and processes were small that inelegance was easy to overlook. But now processes tend to be huge, and that makes copying even just a parent's resident set, and page table fiddling for the rest, extremely expensive.

But vfork() has all that elegance, and none of the downsides of fork()!

vfork() does have one downside: that the parent (specifically: the thread in the parent that calls vfork()) and child share a stack, necessitating that the parent (thread) be stopped until the child exec()s or _exit()s. (This can be forgiven due to vfork(2)'s long preceding threads -- when threads came along the need for a separate stack for each new thread became utterly clear and unavoidable. The fix for threading was to use a new stack for the new thread and use a callback function and argument as the main()-alike for that new stack.) But blocking is bad because synchronous behavior is bad, especially when it's the only option yet it could have been better. An asynchronous version of vfork() would have to run the child in a new/alternate stack. Let's call it afork(), or avfork(). Now, afork() would have to look a lot like pthread_create(): it has to take a function to call on a new stack, as well as an argument to pass to that function.

I should mention that all the vfork() man pages I've seen say that the parent process is stopped until the child exits/execs, but this predates threads. Linux, for example, only stops the one thread in the parent that called vfork(), not all threads. I believe that is the correct thing to do, but IIRC other OSes stop all threads in the parent process (which is a mistake, IMO).

An afork() would allow a popen() like API to return very quickly with appropriate pipes for I/O with the child(ren). If anything goes wrong on the child side then the child(ren) will exit and their output pipe (if any) will evince EOF, and/or writes to the child's input will get EPIPE and/or will raise SIGPIPE, at which point the caller of popen() will be able to check for errors.

One might as well borrow the Illumos forkx()/vforkx() flags, and make afork() look like this:

pid_t afork(int (*start_routine)(void *), void *arg);
pid_t aforkx(int flags /* FORK_NOSIGCHLD and/or FORK_WAITPID */, int (*fn)(void *), void *arg);

It turns out that afork() is easy to implement on Linux: it's just a clone(<function>, <stack>, CLONE_VM | CLONE_SETTLS, <argument>) call. (One might want to request that SIGCHLD be sent to the parent when the child dies, but this is decidedly not desirable in a popen() implementation, as otherwise the program might reap it before pclose() can reap it. For more on this go look at Illumos.)

One can also implement something like afork() (minus the Illumos forkx() flags) on POSIX systems by using pthread_create() to start a thread that will block in vfork() while the afork() caller continues its business. Add a taskq to pre-create as many such worker threads as needed, and you'll have a fast afork(). However, an afork() implemented this way won't be able to return a PID unless the threads in the taskq pre-vfork (good idea!), instead it would need a completion callback, something like this:

int emulated_afork(int (*start_routine)(void *), void *arg, void (*cb)(pid_t) /* may be NULL */);

If the threads pre-vfork, then a PID-returning afork() can be implemented, though communicating a task to a pre-vforked thread might be tricky: pthread_cond_wait() might not work in the child, so one would have to use a pipe into which to write a pointer to the dispatched request. (Pipes are safe to use on the child side of vfork(). That is, read() and write() on pipes are safe in the child of vfork().) Here's how that would work:

// This only works if vfork() only stops the one thread in the
// parent that called vfork(), not all threads.  E.g., as on Linux.
// Otherwise this fails outright and there is no way to implement
// avfork().  Of course, on Linux one can just use clone(2).

static struct avfork_taskq_s { /* elided */ ... } *avfork_taskq;

static void
    // Elided, left as exercise for the reader
// Other taskq functions called below also elided here

// taskq thread create start routine
static void *
worker_start_routine(void *arg)
    struct worker_s *me = arg;
    struct job_s *job;
    // Register the worker and pthread_cond_signal() up to one thread
    // that might be waiting for a worker.
    avfork_taskq_add_worker(avfork_taskq, me);
    do {
        if ((job = calloc(1, sizeof(*job))) == NULL ||
            pipe2(job->dispatch_pipe, O_CLOEXEC) == -1 ||
            pipe2(job->ready_pipe, O_CLOEXEC) == -1 ||
            (pid = vfork()) == -1) {
            avfork_taskq_remove(avfork_taskq, me, errno); // We're out!
        if (pid != 0) {
            // The child exited or exec'ed
            if (job->errno)
                // The child failed to get a job
                // The child took a job; record it so we can reap it
                // later.
                // This also marks this worker as available and signals
                // up to one thread that might be waiting for a worker.
                avfork_taskq_record_child(avfork_taskq, me, job, pid);
            if (avfork_taskq_too_big_p(avfork_taskq))
                break; // Dynamically shrink the taskq

        // This is the child
        // Notice that only read(2), write(2), _exit(2), and the start_routine
        // from the avfork() call are called here.  The avfork() start_routine()
        // should only call async-signal-safe functions and should not call
        // anything that's not safe on the child-side of vfork().  Depending
        // on the OS or C library it may not be possible to use some or any
        // kind of locks, condition variables, allocators, RTLD, etc...  At least
        // dup2(2), close(2), sigaction(2), signal masking functions, exec(2),
        // and _exit(2) are safe to call in start_routine(), and that's enough
        // to implement posix_spawn(), a better popen(), better system(),
        // and so on.
        // Note too that the child does not refer to the taskq at all.
        // Get a job
        if (net_read(me->dispatch_pipe[0], &job->descr, sizeof(job->descr)) != sizeof(job->descr)) {
            job->errno = errno ? errno : EINVAL;
        job->descr->pid = getpid(); // Save the pid where a thread in the parent can see it
        if(net_write(me->ready_pipe[1], "", sizeof("")) != sizeof(""))  {
            job->errno = errno;
        // Do the job
    } while(!avfork_taskq->terminated); // Perhaps this gets set via atexit()

    return NULL;

avfork(int (*start_routine)(void *), void *arg)
    static pthread_once_t once = PTHREAD_ONCE_INIT;
    struct worker_s *worker;
    struct job_descr_s job;
    struct job_descr_s *jobp = &job;
    char c;
    // avfork_taskq_init() is elided here, but one can imagine what it
    // looks like.  It might grow up to N worker threads, and thereafter
    // if there are no available workers then taskq.get_worker() blocks
    // in a pthread_cond_wait() until a worker is ready.
    pthread_once(&once, avfork_taskq_init);
    // Describe the job
    memset(&job, 0, sizeof(job));
    job.start_routine = start_routine;
    job.arg = arg;

    worker = avfork_taskq_get_worker(avfork_taskq); // Lockless when possible; starts a worker if needed

    // Send the worker our job.  If we're lucky, we only wait for an already
    // pre-vfork()ed child to read our job and indicate readiness.  If we're
    // unlucky then the worker we got is busy going through vfork().  Worker
    // threads really don't do much though, so we should usually get lucky.
    // The taskq should be sized so that there isn't too much contention for
    // workers, and to grow dynamically so that at first there are no workers.
    // Perhaps it could grow without bounds when demand is great, then shrink
    // when demand is low (see worker_start_routine()).
    if (net_write(worker->dispatch_pipe[1], &jobp, sizeof(jobp)) != sizeof(jobp) ||
        net_read(worker->ready_pipe[0], &c, sizeof(c)) != sizeof(c))
        job.errno = errno ? errno : EINVAL;

    // Cleanup
    (void) close(worker->dispatch_pipe[0]);
    (void) close(worker->dispatch_pipe[1]);
    (void) close(worker->ready_pipe[0]);
    (void) close(worker->ready_pipe[1]);
    if (job.errno)
        return -1;
    return; // when the read returns the PID is in pid

The title also says that clone(2) is stupid. Allow me to address that. clone(2) was originally added as an alternative to proper POSIX threads that could be used to implement POSIX threads. The idea was that lots of variations on fork() would be nice, and as we see here, that's actually true as to avfork()! avfork() was not the motivation, however. A lot of mistakes were made on the way to the NPTL.

Linux should have had a thread creation system call -- it would have then saved itself the pain of the first pthread implementation for Linux. Linux should have learned from Solaris/SVR4, where emulation of BSD sockets via libsocket on top of STREAMS proved to be a very long and costly mistake. Emulating one API from another API with impedance mismatches is difficult at best.

Since then clone() has become a swiss army knife -- it has evolved to have zone/jail entering features, but only sort of: Linux doesn't have proper zones/jails, instead adding new "namespaces" and new clone(2) flags to go with them. And as new container-related clone(2) flags are added that old code might wish it had used them... one will have to modify and rebuild the clone(2)-calling world, and that is decidedly not elegant.

Linux should have had first-class fork(), vfork(), avfork(), thread_create(), and container_create() system calls. The fork family could have been one system call with options, but threads are not processes, and neither are containers (though containers may have processes, and may have a minder/init process). Conflating all of those onto one system call seems a bit much, though even that would be OK if there was just one flag for container entry/start/fork/whatever-metaphor-applies-to-containers. But the clone(2) design, or its maintainers, encourages a proliferation of flags, which means one must constantly pay attention to the possible need to add new flags at existing call sites.

Now, my friends tell me, and I read around too, that "nah, containers aren't zones/jails, they're not meant to be used like that", and I don't care about that line of argument. The world needs zones/jails and Linux containers really want to be zones/jails. They do. And zones/jails need to start life maximally isolated, and sharing needs to be added explicitly from the host. Doing it the other way around is badly broken, because every time isolation is increased one has to go patch clone(2) calls. That's not a good approach to security for an OS that is not integrated top-to-bottm (on Linux everything has different maintainers and communities: the kernel, the C libraries, every important system library, the shells, the init system, all the user-land programs one expects -- everything). In a world like that containers need to start maximally isolated.

I could go on. I could talk about fork-safety. I could discuss all of the functions that are generally, or in specific cases, safe to call in a child of fork(), versus the child of vfork(), versus the child of afork() (if we had one), or a child of a clone() call (but I'd have to consider quite a few flag combinations!). I could go into why 4.4BSD removed vfork() (I'd have to do a bit more digging though). I think this post's length is probably just right, so I'll leave it here.


This comment has been minimized.

Copy link

@dsd dsd commented Mar 7, 2018

@nicowilliams thanks for this very interesting writeup. If you have time/interest I would love to read followups on the items you mention, particularly what can and can't be done in the child of afork(). I'm scoping out how to make GNOME glib's process spawning functions do something better than the fork() + misc stuff + exec() which is running into an issue where the fork() fails since Linux does not want to duplicate all the memory space from the parent process.

I also found which is an interesting read highlighting other problems with vfork: signals and a potential privilege dropping race.

From there, I also saw that glibc's posix_spawn was rewritten in 2016 to use clone() with CLONE_VM, in line with one of your suggestions above, while also solving some of the issues described on However it does still block the one parent thread, and the privilege dropping race is left to the programmer to avoid.


This comment has been minimized.

Copy link

@NobodyXu NobodyXu commented Sep 26, 2020

Hi @nicowilliams

Your post on avfork is certainly very interesting for me, so I took me time and implemented aspawn, which does exactly what avfork
does, but without setting up TLS and instead let user do syscall directly via pure_syscall implemented in my library that does not use any global/thread local variable at all.

My aspawn has signature:

struct Stack_t {
    void *addr;
    size_t size;

typedef int (*aspawn_fn)(void *arg, int wirte_end_fd, void *old_sigset, void *user_data, size_t user_data_len);

 * @return fd of read end of CLOEXEC pipe if success, eitherwise (-errno).
 * aspawn would disable thread cancellation, then it would revert it before return.
 * aspawn would also mask all signals in parent and reset the signal handler in the child process.
 * Before aspawn returns in parent, it would revert the signal mask.
 * In the function fn, you can only use syscall declared in syscall/syscall.h
 * Use of any glibc function or any function that modifies global/thread-local variable is undefined behavior.
int aspawn(pid_t *pid, struct stack_t *cached_stack, size_t reserved_stack_sz, 
           aspawn_fn fn, void *arg, void *user_data, size_t user_data_len);

By returning the write end of the CLOEXEC pipefd, user of this library is able to receive error message/check whether
the child process has done using cached_stack so that aspawn can reuse cached_stack.

It also allows user to pass arbitary data in the stack via user_data and user_data_len, which get copies onto top of
the stack, thus user does not have to allocate them separately on heap or mistakenly overwriten an object used in child process.

To use a syscall, you need to include syscall/syscall.h, which defines the syscall routine used by the child process including
find_exe, psys_execve and psys_execveat.

Compare to posix_spawn, aspawn has 3 advantages:

  • aspawn allows user to do anything in the child process before exec.
  • aspawn can reuse stack, posix_spawn can't;
  • aspawn doesn't block the parent thread;

This comment has been minimized.

Copy link

@NobodyXu NobodyXu commented Sep 26, 2020

Responsive benchmark

Responsive comparison between posix_spawn and aspawn, source code (benchmarking is done via google/benchmark):

$ ll -h bench_aspawn_responsiveness.out
-rwxrwxr-x 1 nobodyxu nobodyxu 254K Oct  2 15:02 bench_aspawn_responsiveness.out*

$ uname -a
Linux pop-os 5.4.0-7642-generic #46~1598628707~20.04~040157c-Ubuntu SMP Fri Aug 28 18:02:16 UTC  x86_64 x86_64 x86_64 GNU/Linux

$ ./a.out
Running ./bench_aspawn_responsiveness.out
Run on (12 X 4100 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 9216 KiB (x1)
Load Average: 0.31, 0.36, 0.32
Benchmark                           Time             CPU   Iterations
BM_aspawn_no_reuse              18009 ns        17942 ns        38943
BM_aspawn/threads:1             14500 ns        14446 ns        48339
BM_vfork_with_shared_stack      46545 ns        16554 ns        44027
BM_fork                         54583 ns        54527 ns        12810
BM_posix_spawn                 125061 ns        29091 ns        24483

The column "Time" is measured in terms of system clock, while "CPU" is measured in terms of per-process CPU time.

Throughput benchmark

Since aspawn allows user to do anything in the vforked child via aspawn_fn, it makes no sense
to benchmark how many processes can aspawn created as it depends on user provided argument fn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment