Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 182 You must be signed in to star a gist
  • Fork 9 You must be signed in to fork a gist
  • Save nicowilliams/a8a07b0fc75df05f684c23c18d7db234 to your computer and use it in GitHub Desktop.
Save nicowilliams/a8a07b0fc75df05f684c23c18d7db234 to your computer and use it in GitHub Desktop.
fork() is evil; vfork() is goodness; afork() would be better; clone() is stupid

I recently happened upon a very interesting implementation of popen() (different API, same idea) called popen-noshell using clone(2), and so I opened an issue requesting use of vfork(2) or posix_spawn() for portability. It turns out that on Linux there's an important advantage to using clone(2). I think I should capture the things I wrote there in a better place. A gist, a blog, whatever.

This is not a paper. I assume reader familiarity with fork() in particular and Unix in general, though, of course, I link to relevant wiki pages, so if the unfamiliar reader is willing to go down the rabbit hole, they should be able to come out far more knowledgeable on these topics.

This gist got posted on Hacker News and was on the front page for a few hours, and there is a lot of interesting commentary there. And, yes, the topic of vfork(2) is always rather controversial -- readers should know that there are those who strongly disagree with the take I put forth in this gist.

Microsoft published a very relevant paper on this topic, A Fork in the Road a couple of years after I wrote this gist. I recommend it. It too was discussed on HN.

Some additional links I've found that might be of interest to readers:

So here goes.

Long ago, I, like many Unix fans, thought that fork(2) and the fork-exec process spawning model were the greatest thing, and the Windows sucked for only having exec*() and _spawn*(), the last being a Windows-ism.

After many years of experience, I learned that fork(2) is in fact evil. And vfork(2), long said to be evil, is in fact goodness. A slight variant of vfork(2) that avoids the need to block the parrent would be even better (see below).

Extraordinary statements require explanation, so allow me to explain.

I won't bother explaining what fork(2) is -- if you're reading this, I assume you know, but if not, see the linked wikipedia page. But I'll explain vfork(2) and why it was said to be harmful. vfork(2) is very similar to fork(2), but the new process it creates runs in the same address space as the parent as if it were a thread, even sharing the same stack as the thread that called vfork(2)! Two threads can't really share a stack, so the parent is stopped while the child does its thing: either exec*(2) or _exit(2). The two threads do share a stack though, so the child has to be careful not to corrupt the stack for the parent.

Now, 3BSD added vfork(2), and a few years later 4.4BSD removed it as it was by then considered harmful. (I cannot find the link to the article or paper from the 80s that declared vfork(2) harmful, but I swear I remember seeing it it. I would appreciate a link.) Most subsequent man pages say as much. But the derivatives of 4.4BSD restored it and do not call it harmful. There's a reason for this: vfork(2) is much cheaper than fork(2) -- much, much, much cheaper. That's because fork(2) has to either copy the parent's address space, or arrange for copy-on-write (CoW) (CoW which is supposed to be an optimization to avoid unnecessary copies). But even CoW is very expensive because it requires modifying memory mappings, doing TLB shootdowns if the parent is multi-threaded, taking expensive page faults, and so on. Modern kernels tend to seed the child with a copy of the parent's resident set, but if the parent has a large memory footprint (e.g., is a JVM), then the RSS will be huge. So fork(2) is inescapably expensive except for small programs with small footprints (e.g., a shell).

So you begin to see why fork(2) is evil. And I haven't yet gotten to fork-safety perils! Fork-safety considerations are a lot like thread-safety, but it is harder to make libraries fork-safe than thread-safe. I'm not going to go into fork-safety here: it's not necessary.

Before I go on I should admit to hypocrisy: I do write code that uses fork(2), often for multi-processing daemons -- as opposed to multi-threading, though I often do the latter as well. But the forks there happen very early on when nothing fork-unsafe has happened yet and the address space is small, thus avoiding most evils of fork(2). vfork(2) cannot be used for this purpose. On Windows one would have to CreateProcess() or _spawn() to implement multi-processed daemons, which is a huge pain in the neck.

Why did I ever think fork(2) was elegant then? It was the same reason that everyone else did and does: CreateProcess*(), _spawn() and posix_spawn(), and related APIs, are extremely complex. They have to be because there is an enormous number of things one might do between fork() and exec() in, say, a shell. That complexity makes fork()+exec() look good. With fork() and exec() one does not need a language or API that can express all those things: the host language will do! fork(2) gave the Unix's creators the ability to move all that complexity out of kernel-land into user-land, where it's much easier to develop software -- it made them more productive, perhaps much more so. The price Unix's creators paid for that elegance was the need to copy address spaces. Since back then programs and processes were small that inelegance was easy to overlook or ignore. But now processes tend to be huge and multi-threaded, and that makes copying even just a parent's resident set, and page table fiddling for the rest, extremely expensive.

But vfork() has all that elegance, and none of the downsides of fork()!

vfork() does have one downside: that the parent (specifically: the thread in the parent that calls vfork()) and child share a stack, necessitating that the parent (thread) be stopped until the child exec()s or _exit()s. (This can be forgiven due to vfork(2)'s long preceding threads -- when threads came along the need for a separate stack for each new thread became utterly clear and unavoidable. The fix for threading was to use a new stack for the new thread and use a callback function and argument as the main()-alike for that new stack.) But blocking is bad because synchronous behavior is bad, especially when vfork(2) (or clone(2), used like vfork(2)) is the only performant alternative to fork(2), yet it could have been better.

An asynchronous version of vfork(2) would have to run the child in a new/alternate stack, much like a thread. Let's call it afork(), or maybe avfork(). Now, afork() would have to look a lot like pthread_create(): it has to take a function to call on a new stack, as well as an argument to pass to that function.

I should mention that all the vfork(2) man pages I've seen say that the parent process is stopped until the child exits/execs, but this predates threads. Linux, for example, only stops the one thread in the parent that called vfork(), not all threads. I believe that is the correct thing to do, but IIRC other OSes stop all threads in the parent process (which is a mistake, IMO).

Some years ago I successfully talked NetBSD developers out of making vfork(2) stop all threads in the parent.

An afork() would allow a popen() like API to return very quickly with appropriate pipes for I/O with the child(ren). If anything goes wrong on the child side then the child(ren) will exit and their output pipe (if any) will evince EOF, and/or writes to the child's input will get EPIPE and/or will raise SIGPIPE, at which point the caller of popen() will be able to check for errors.

One might as well borrow the Illumos forkx()/vforkx() flags, and make afork() look like this:

pid_t afork(int (*start_routine)(void *), void *arg);
pid_t aforkx(int flags /* FORK_NOSIGCHLD and/or FORK_WAITPID */, int (*fn)(void *), void *arg);

It turns out that afork() is easy to implement on Linux: it's just a clone(<function>, <stack>, CLONE_VM | CLONE_SETTLS, <argument>) call. (One might want to request that SIGCHLD be sent to the parent when the child dies, but this is decidedly not desirable in a popen() implementation, as otherwise the program might reap it before pclose() can reap it, then pclose() could not return the correct result. For more on this go look at Illumos.)

See the comments on this gist. In particular, see @NobodyXu's comment about his aspawn!

One can also implement something like afork() (minus the Illumos forkx() flags) on POSIX systems by using pthread_create() to start a thread that will block in vfork() while the afork() caller continues its business. Add a taskq to pre-create as many such worker threads as needed, and you'll have a very fast afork(). However, an afork() implemented this way won't be able to return a PID unless the threads in the taskq pre-vfork (good idea!), instead it would need a completion callback, something like this:

int emulated_afork(int (*start_routine)(void *), void *arg, void (*cb)(pid_t) /* may be NULL */);

If the threads pre-vfork, then a PID-returning afork() can be implemented, though communicating a task to a pre-vforked thread might be tricky: pthread_cond_wait() might not work in the child, so one would have to use a pipe into which to write a pointer to the dispatched request. (Pipes are safe to use on the child side of vfork(). That is, read() and write() on pipes are safe in the child of vfork().) Here's how that would work:

// This only works if vfork() only stops the one thread in the
// parent that called vfork(), not all threads.  E.g., as on Linux.
// Otherwise this fails outright and there is no way to implement
// avfork().  Of course, on Linux one can just use clone(2).

static struct avfork_taskq_s { /* elided */ ... } *avfork_taskq;

static void
avfork_taskq_init(void)
{
    // Elided, left as exercise for the reader
    ...
}
// Other taskq functions called below also elided here

// taskq thread create start routine
static void *
worker_start_routine(void *arg)
{
    struct worker_s *me = arg;
    struct job_s *job;
    
    // Register the worker and pthread_cond_signal() up to one thread
    // that might be waiting for a worker.
    avfork_taskq_add_worker(avfork_taskq, me);
    do {
        if ((job = calloc(1, sizeof(*job))) == NULL ||
            pipe2(job->dispatch_pipe, O_CLOEXEC) == -1 ||
            pipe2(job->ready_pipe, O_CLOEXEC) == -1 ||
            (pid = vfork()) == -1) {
            avfork_taskq_remove(avfork_taskq, me, errno); // We're out!
            break;
        }
        if (pid != 0) {
            // The child exited or exec'ed
            if (job->errno)
                // The child failed to get a job
                reap_child(pid);
            else
                // The child took a job; record it so we can reap it
                // later.
                // This also marks this worker as available and signals
                // up to one thread that might be waiting for a worker.
                avfork_taskq_record_child(avfork_taskq, me, job, pid);
                
            if (avfork_taskq_too_big_p(avfork_taskq))
                break; // Dynamically shrink the taskq

            continue;
        }
        
        // This is the child
        
        // Notice that only read(2), write(2), _exit(2), and the start_routine
        // from the avfork() call are called here.  The avfork() start_routine()
        // should only call async-signal-safe functions and should not call
        // anything that's not safe on the child-side of vfork().  Depending
        // on the OS or C library it may not be possible to use some or any
        // kind of locks, condition variables, allocators, RTLD, etc...  At least
        // dup2(2), close(2), sigaction(2), signal masking functions, exec(2),
        // and _exit(2) are safe to call in start_routine(), and that's enough
        // to implement posix_spawn(), a better popen(), better system(),
        // and so on.
        
        // Note too that the child does not refer to the taskq at all.
        
        // Get a job
        if (net_read(me->dispatch_pipe[0], &job->descr, sizeof(job->descr)) != sizeof(job->descr)) {
            job->errno = errno ? errno : EINVAL;
            _exit(1);
        }
        job->descr->pid = getpid(); // Save the pid where a thread in the parent can see it
        if(net_write(me->ready_pipe[1], "", sizeof("")) != sizeof(""))  {
            job->errno = errno;
            _exit(1);
        }
        
        // Do the job
        _exit(job->descr->start_routine(job->descr->arg));
    } while(!avfork_taskq->terminated); // Perhaps this gets set via atexit()

    return NULL;
}

pid_t
avfork(int (*start_routine)(void *), void *arg)
{
    static pthread_once_t once = PTHREAD_ONCE_INIT;
    struct worker_s *worker;
    struct job_descr_s job;
    struct job_descr_s *jobp = &job;
    char c;
    
    // avfork_taskq_init() is elided here, but one can imagine what it
    // looks like.  It might grow up to N worker threads, and thereafter
    // if there are no available workers then taskq.get_worker() blocks
    // in a pthread_cond_wait() until a worker is ready.
    pthread_once(&once, avfork_taskq_init);
    
    // Describe the job
    memset(&job, 0, sizeof(job));
    job.start_routine = start_routine;
    job.arg = arg;

    worker = avfork_taskq_get_worker(avfork_taskq); // Lockless when possible; starts a worker if needed

    // Send the worker our job.  If we're lucky, we only wait for an already
    // pre-vfork()ed child to read our job and indicate readiness.  If we're
    // unlucky then the worker we got is busy going through vfork().  Worker
    // threads really don't do much though, so we should usually get lucky.
    //
    // The taskq should be sized so that there isn't too much contention for
    // workers, and to grow dynamically so that at first there are no workers.
    // Perhaps it could grow without bounds when demand is great, then shrink
    // when demand is low (see worker_start_routine()).
    if (net_write(worker->dispatch_pipe[1], &jobp, sizeof(jobp)) != sizeof(jobp) ||
        net_read(worker->ready_pipe[0], &c, sizeof(c)) != sizeof(c))
        job.errno = errno ? errno : EINVAL;

    // Cleanup
    (void) close(worker->dispatch_pipe[0]);
    (void) close(worker->dispatch_pipe[1]);
    (void) close(worker->ready_pipe[0]);
    (void) close(worker->ready_pipe[1]);
    
    if (job.errno)
        return -1;
    return job.pid; // when the read returns the PID is in pid
}

The title also says that clone(2) is stupid. Allow me to address that.

Now, I don't mean to offend. It's not really stupid. Calling it stupid in the title is just a rhetorical device ("made you look!"). A cheap rhetorical device, yes. Not exactly a professional rhetorical device either. But this is a gist, not a paper, and I never expected it to go viral. In retrospect I should have used a softer word, though, if I were to contribute an afork(2) to Linux, I expect much stronger language might be used in discussing my patches -- it's standard operating procedure on the Linux kernel lists, so I expect no Linux kernel developer is offended here!

clone(2) was originally added as an alternative to proper POSIX threads that could be used to implement POSIX threads. It seems to have been inspired by the Plan 9 rfork(2). The idea was that lots of variations on fork() would be nice, and as we see here, there are lots of variations on it (forkx(2), vfork(2), vforkx(2), rfork(2))!

Perhaps Linux should have had a thread creation system call -- Linux might have then saved itself the pain of the first pthread implementation for Linux. (A lot of mistakes were made on the way to the NPTL.) Linux should have learned from Solaris/SVR4, where emulation of BSD sockets via libsocket on top of STREAMS proved to be a mistake that took a long time and much expense to fix. Emulating one API from another API with impedance mismatches is usually difficult at best.

Since then clone(2) has become a swiss army knife -- it has evolved to have zone/jail entering features, but only sort of: Linux doesn't have proper zones/jails, instead Linux added new clone(2) flags to to indicate namespaces that should not be shared with the parent. And as new container-related clone(2) flags are added that old code might wish it had used them... one will have to modify and rebuild the clone(2)-calling world, and that is decidedly not elegant.

Linux should have had first-class fork(), vfork(), avfork(), thread_create(), and container_create() type system calls. The fork family could have been one system call with options, but threads are not processes, and neither are containers (though containers may have processes, and may have a minder/init process). Conflating all of those onto one system call seems a bit much, though even that would be OK if there was just one flag for container entry/start/fork/whatever-metaphor-applies-to-containers. But the clone(2) design encourages a proliferation of flags, which means one must constantly pay attention to the possible need to add new flags at existing call sites.

Now, my friends tell me, and I read around too, that "nah, containers aren't zones/jails, they're not meant to be used like that", and I don't care for that line of argument. The world needs zones/jails and Linux containers really want to be zones/jails. And zones/jails need to start life maximally isolated, and sharing needs to be added explicitly from the host. Doing it the other way around is broken because every time isolation is increased one has to go patch clone(2) calls.

Counterpoint: doing it the Solaris way also requires patching the call sites when new types namespaces are added, so maybe my argument falls flat. Perhaps zone creation should have a profile name parameter that allows patching to be applied to configuration files rather than code.

That's not a good approach to security for an OS that is not integrated top-to-bottom (on Linux everything has different maintainers and communities: the kernel, the C libraries, every important system library, the shells, the init system, all the user-land programs one expects -- everything). In a world like that containers need to start maximally isolated -- in my opinion anyways.

I could go on. I could talk about fork-safety. I could discuss all of the functions that are generally, or in specific cases, safe to call in a child of fork(), versus the child of vfork(), versus the child of afork() (if we had one), or a child of a clone() call (but I'd have to consider quite a few flag combinations!). I could go into why 4.4BSD removed vfork() (I'd have to do a bit more digging though). I think this post's length is probably just right, so I'll leave it here.

@dsd
Copy link

dsd commented Mar 7, 2018

@nicowilliams thanks for this very interesting writeup. If you have time/interest I would love to read followups on the items you mention, particularly what can and can't be done in the child of afork(). I'm scoping out how to make GNOME glib's process spawning functions do something better than the fork() + misc stuff + exec() which is running into an issue where the fork() fails since Linux does not want to duplicate all the memory space from the parent process.

I also found https://ewontfix.com/7/ which is an interesting read highlighting other problems with vfork: signals and a potential privilege dropping race.

From there, I also saw that glibc's posix_spawn was rewritten in 2016 to use clone() with CLONE_VM, in line with one of your suggestions above, while also solving some of the issues described on ewontfix.com. However it does still block the one parent thread, and the privilege dropping race is left to the programmer to avoid.

@NobodyXu
Copy link

NobodyXu commented Sep 26, 2020

Hi @nicowilliams

Your post on avfork is certainly very interesting for me, so I took me time and implemented aspawn, which does exactly what avfork
does, but without setting up TLS and instead let user do syscall directly via pure_syscall implemented in my library that does not use any global/thread local variable at all.

My aspawn has signature:

struct Stack_t {
    void *addr;
    size_t size;
};

typedef int (*aspawn_fn)(void *arg, int wirte_end_fd, void *old_sigset, void *user_data, size_t user_data_len);

/**
 * @return fd of read end of CLOEXEC pipe if success, eitherwise (-errno).
 *
 * aspawn would disable thread cancellation, then it would revert it before return.
 *
 * aspawn would also mask all signals in parent and reset the signal handler in the child process.
 * Before aspawn returns in parent, it would revert the signal mask.
 *
 * In the function fn, you can only use syscall declared in syscall/syscall.h
 * Use of any glibc function or any function that modifies global/thread-local variable is undefined behavior.
 */
int aspawn(pid_t *pid, struct stack_t *cached_stack, size_t reserved_stack_sz, 
           aspawn_fn fn, void *arg, void *user_data, size_t user_data_len);

By returning the write end of the CLOEXEC pipefd, user of this library is able to receive error message/check whether
the child process has done using cached_stack so that aspawn can reuse cached_stack.

It also allows user to pass arbitary data in the stack via user_data and user_data_len, which get copies onto top of
the stack, thus user does not have to allocate them separately on heap or mistakenly overwriten an object used in child process.

To use a syscall, you need to include syscall/syscall.h, which defines the syscall routine used by the child process including
find_exe, psys_execve and psys_execveat.

Compare to posix_spawn, aspawn has 3 advantages:

  • aspawn allows user to do anything in the child process before exec.
  • aspawn can reuse stack, posix_spawn can't;
  • aspawn doesn't block the parent thread;

@NobodyXu
Copy link

NobodyXu commented Sep 26, 2020

Responsive benchmark

Responsive comparison between posix_spawn and aspawn, source code (benchmarking is done via google/benchmark):

$ ll -h bench_aspawn_responsiveness.out
-rwxrwxr-x 1 nobodyxu nobodyxu 254K Oct  2 15:02 bench_aspawn_responsiveness.out*

$ uname -a
Linux pop-os 5.4.0-7642-generic #46~1598628707~20.04~040157c-Ubuntu SMP Fri Aug 28 18:02:16 UTC  x86_64 x86_64 x86_64 GNU/Linux

$ ./a.out
2020-10-02T15:02:45+10:00
Running ./bench_aspawn_responsiveness.out
Run on (12 X 4100 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 256 KiB (x6)
  L3 Unified 9216 KiB (x1)
Load Average: 0.31, 0.36, 0.32
---------------------------------------------------------------------
Benchmark                           Time             CPU   Iterations
---------------------------------------------------------------------
BM_aspawn_no_reuse              18009 ns        17942 ns        38943
BM_aspawn/threads:1             14500 ns        14446 ns        48339
BM_vfork_with_shared_stack      46545 ns        16554 ns        44027
BM_fork                         54583 ns        54527 ns        12810
BM_posix_spawn                 125061 ns        29091 ns        24483

The column "Time" is measured in terms of system clock, while "CPU" is measured in terms of per-process CPU time.

Throughput benchmark

Since aspawn allows user to do anything in the vforked child via aspawn_fn, it makes no sense
to benchmark how many processes can aspawn created as it depends on user provided argument fn.

@nicowilliams
Copy link
Author

nicowilliams commented Dec 3, 2021

@dsd sorry I'm responding so late!

I also found https://ewontfix.com/7/ which is an interesting read highlighting other problems with vfork: signals and a potential privilege dropping race.

https://ewontfix.com/7/ is utter nonsense:

setuid and vfork

Now we get to the worst of it. Threads and vfork allow you to get in a situation where two processes are both sharing memory space and running at the same time. Now, what happens if another thread in the parent calls setuid (or any other privilege-affecting function)? You end up with two processes with different privilege levels running in a shared address space. And this is A Bad Thing.

Yeah, because setuid() is not for threaded programs! If you want to use setuid() and friends in a threaded program then you must fork()/vfork() first. This is well well understood. Perhaps there should be some protection against racing of these things (seems difficult), but if you just make sure there's no threads when you use setuid() and friends then there cannot be races.

If you must race these things, you can protect against racing by doing this: first check credentials/privileges, then vfork(), then check that the child has the same credentials/privileges as the parent -- if not, abort.

Signal handlers and vfork

OK, so this is a problem, but fork() also has signals issues. It's not uncommon to has an exec helper to unblock signals and then exec its arguments so that signals can be left blocked all the way until exec(). That works for fork() and vfork().

For example, suppose you block signals, fork(), unblock signals (parent and child), and exec on the child side, but a signal arrives before the exec -- now what? There's no guarantee that whatever signal handlers are installed will do anything useful or meaningful or not dangerous on the child-side of the fork! Well, so you could install signal handlers that understand these situations (both! fork() and vfork()), but not if you're forking in a library. Now, you shouldn't be forking in libraries anyways (unless, and maybe even if the caller understands you're doing that)... And if it's not library code, then you really can make your signal handlers smart enough to handle vfork().

But in any case, the safest thing to do, whether you're fork()ing-to-exec or vfork()ing-to-exec, is to block signals and then exec a helper program that unblocks signals and then execs the real target.

It'd be really nice to have an exec system call that can be given a signal mask of signals to unblock atomically...

@nicowilliams
Copy link
Author

@NobodyXu NICE!!!!

@FrankHB
Copy link

FrankHB commented Jan 13, 2022

Seems most of them are evil enough.

(BTW, NtCreateProcess is more relavant than the shitty CreateProcess here.)

@nicowilliams
Copy link
Author

nicowilliams commented Jan 13, 2022

@FrankHB thanks for that link!

It may not surprise you that I disagree with this part:

However, because of the shared address space, vfork() is difficult to use safely [34].

It's not really any harder than threading, and, in fact, it's easier, since there is no concurrency to worry about.

I also don't agree with this:

Although vfork() avoids the cost of cloning the address space, and
may help to replace fork where refactoring to use spawn is
impractical, in most cases it is better avoided.

The thing is that spawn APIs are not easy to use -because, after all, there are so many variations of custom behaviors to get from a spawn API!- so vfork() has something of an advantage. For example, one may want to start a new tty session (i.e., have the child call setsid(2)), but one may also want the child to not be a session leader (i.e., [v]fork again and have the intermediate process, the process leader, exit), and so on and on. There's just so much that one might want to do between where vfork(2) returns to the child and where the final exec-or-_exit is done, that capturing all of that in a spawn API is non-trivial, and even capturing a non-trivial subset of all of that leads to a difficult-to-use API. They even say so themselves:

It is infeasible for a single OS API to give complete control over the initial state of a new process.

Though at least, IMO the POSIX spawn API is nicer than the WIN32 spawn APIs.

Then they say:

but clean-slate designs [e.g., 40, 43] have demonstrated an alternative model where system calls that modify per-process state are not constrained to merely the current process, but rather can manipulate any process to which the caller has access. This yields the flexibility and
orthogonality of the fork/exec model, without most of its drawbacks:

OK, that works: create a stopped process with no state and then build it up from a debugger-like API. You get the full power of Unix's vfork(2), but in a more elegant (design-wise) way. I suspect the resulting code would be very complex without first building some simple abstractions, but otherwise it's fine. Such abstractions can probably be designed so they can be implemented with vfork(2) anyways, much like posix_spawn*() can be implemented as a library (using vfork(2), clone(2), or even fork(2)) or as system calls -- both have been done, which IMO shows that vfork(2) is still relevant, powerful, and elegant, and that there's room for an "async" variant like what I propose here.

Overall I agree with the paper's thrust.

@jbash
Copy link

jbash commented Feb 28, 2022

The UNIX process creation model is wrong from the get-go. fork-and-exec seemed cute at the time, but it turns out that child processes should not just inherit everything from their parents. That decision has left a 50-year trail of horrible security bugs. Especially, if you want to pass an open FD to a child, you should have to explicitly say you want to do that... but most other process state should also only be inherited by positive choice. Signal handlers??? And the total-default-inheritance problem turned out to be a bad combination with the introduction of setuid programs.

For that matter the UID-based security model itself has also turned out to be a mistake, although a less obvious one and one shared with basically every other system of the day.

And using the word "capability" to describe Linux "capabilities" is outright blasphemy.

@jfmatth
Copy link

jfmatth commented Feb 28, 2022

I wish more people would do writeup's like this, thanks for the GIST :)

@ThosM
Copy link

ThosM commented Feb 28, 2022

Nascent Linux was heavily influenced by SunOS 3.x. Linus Torvalds modeled the kernel system calls from what the
SunOS man pages described. The weirdness started when Sun's fork(2) man page described a planned future expansion.
Linus based process spawning on Sun's present and future plans. Thus clone(2) was also born.

So early Linux circa 0.11 (late 1991) had system calls and manifest constants based on SunOS 3.x. Even the early Linux
man pages were directly lifted from SunOS, complete with copyright notices. This was all before Sun Solaris's time. Sadly
SunOS 3.x did not live long enough to see future expansion.

Where the Linux C library came from, I suspect it was influenced by glibc, but it was not glibc. It was the purview
of another Linux developer H.J. Lu. (who now is an engineer with Intel). The Linux C library was designed to be
heavily Posix. For a time the Linux C library was even written in in C++. But the C++ compilers of the day were too immature,
so H.J. reverted to using C for the Linux C library.

Early Linux could compile SunOS code without so much as a hiccup. Even 386BSD couldn't match that. The irony is
that the fan boys didn't know that. Linux was strongly influenced by BSD via SunOS. However, the fanboys made a
false analogy that Linux was the "SysV" enemy. Now insomuch that the SysV library was rudimentary Posix, maybe
the fanboys assumed that Linux was SysV because of Linux's Posix C library (and thus the "enemy").

But at the system call level, (i.e. the actual kernel), Linux was a clone of the SunOS version of BSD. It should
be noted, the most widely used Unix-like systems today, Linux and *BSD, both have ancestral roots in the Berkeley
Software Distribution, albeit through different paths. And both flavors can give a nod to SysV for its influence
on Posix and the C libraries.

@da4089
Copy link

da4089 commented Mar 1, 2022

What about Plan9's rfork() ? https://man.cat-v.org/plan_9/2/fork

@nicowilliams
Copy link
Author

What about Plan9's rfork() ? https://man.cat-v.org/plan_9/2/fork

What about it?

@nicowilliams
Copy link
Author

I've updated this gist to link to the recent HN thread about, the recent Microsoft paper about fork(2), the HN thread for that, and some other things, including @NobodyXu's comment above and aspawn.

@yamirui
Copy link

yamirui commented Mar 2, 2022

Not sure how I got there, but this is pure nonsense. vfork has no advantage in any scenario over fork, and I mean realistic scenario, not a scenario from 3200 BC when nobody had yet came up with an idea to employ copy-on-write where it is obviously objectively useful and isn't slow enough to warrant playing the 4chan meme of being a "real programmer" who is so real that he can't stop implementing bugs in every single line of C code he writes.

@NobodyXu
Copy link

NobodyXu commented Mar 3, 2022

@yamirui Consider a following scenario:

You have a process that allocates a lot of virtual memory (may or may not backed by physically memory) and you need to fork to create processes repeatedly.

In this case, fork will become the bottleneck.

There are a few program that are VM hungry ASAIK:

  • google-chrome
  • JVM

@yamirui
Copy link

yamirui commented Mar 5, 2022

@NobodyXu uh oh, not my forks on processes that use 3TB of RAM with 0 relevant tabs open...

Have you considered that parent process should be the driver and not the one doing the work? In other words, the answer is to simply not fork when you allocated billions of bytes of memory. You really think I can take it seriously when that's your only argument? That bad programmers, who can't think outside of the box, struggle with the problems, they themselves created?

Just so you know, each browser tab is an isolated process for a good reason, and vfork isn't going to cut it, vfork is an irrelevant relic of the past that acts as nothing more than a crutch, if you're creating a browser, at least figure out how computer works before you decide to connect it to the internet, otherwhise you will be making nonsense "bug reports" like this one, which nobody actually cares about and hasn't done anything about for a long time and making suggestions like

well this idea is so fucking shit so um, how about we create intermediatery executable and probably incur extra overhead I didn't think about just to avoid using fork + exec properly

I never noticed any inefficiency while opening tabs in chrome, and I'm posting this from a machine with a fucking Celeron, find something else to do, like a hobby, or maybe even a job.

@NobodyXu
Copy link

NobodyXu commented Mar 5, 2022

@NobodyXu uh oh, not my forks on processes that use 3TB of RAM with 0 relevant tabs open...

Have you considered that parent process should be the driver and not the one doing the work? In other words, the answer is to simply not fork when you allocated billions of bytes of memory. You really think I can take it seriously when that's your only argument? That bad programmers, who can't think outside of the box, struggle with the problems, they themselves created?

Just so you know, each browser tab is an isolated process for a good reason, and vfork isn't going to cut it, vfork is an irrelevant relic of the past that acts as nothing more than a crutch, if you're creating a browser, at least figure out how computer works before you decide to connect it to the internet, otherwhise you will be making nonsense "bug reports" like this one, which nobody actually cares about and hasn't done anything about for a long time and making suggestions like

well this idea is so fucking shit so um, how about we create intermediatery executable and probably incur extra overhead I didn't think about just to avoid using fork + exec properly

I never noticed any inefficiency while opening tabs in chrome, and I'm posting this from a machine with a fucking Celeron, find something else to do, like a hobby, or maybe even a job.

I admit I didn't consider using a driver, which does not allocate much memory, to fork.
So this isn't indeed a problem for chrome.

But what about JVM then?

If you write something that fork() in Java, then you sure would have to pay the cost of duplicating page tables.

A simple java program that runs infinite loop:

public class run {
    public static void main(String[] args) {
        while (true) {
        }
    }
}

takes about 6.7GB virtual memory on my x86-64 linux machine, real world programs written in Java would only take more virtual memory and more resident memory.

In order to fix this problem, you either:

  • Create a program for spawning, written in C/C++/Rust
  • Use an external bash program for spawning
  • Use vfork

Of all the options, using vfork is the easiest, which is why openjdk uses posix_spawn to create process on Linux.

posix_spawn handles all the pain comes with vfork by providing options to change fd and execute the file specified in the child.
It automaticlly uses vfork if possible.

@nicowilliams
Copy link
Author

nicowilliams commented Mar 8, 2022

uh oh, not my forks on processes that use 3TB of RAM with 0 relevant tabs open...

Have you considered that parent process should be the driver and not the one doing the work? In other words, the answer is to simply not fork when you allocated billions of bytes of memory. You really think I can take it seriously when that's your only argument? That bad programmers, who can't think outside of the box, struggle with the problems, they themselves created?

People don't start from scratch and design a system correctly. Instead systems grow over many years, and they accumulate warts. And then when a problem like this gets noticed it turns out that to re-design the system the way you would do it is just to expensive, but to switch to vfork() or to a posix_spawn() that uses vfork() is not expensive at all because it's not a redesign.

well this idea is so fucking shit so um, how about we create intermediatery executable and probably incur extra overhead I didn't think about just to avoid using fork + exec properly

Using vfork() + exec(intermediary executable) is still O(1) compared to fork()+exec()'s O(RSS).

I never noticed any inefficiency while opening tabs in chrome, and I'm posting this from a machine with a fucking Celeron, find something else to do, like a hobby, or maybe even a job.

Let's keep it civil please.

@nicowilliams
Copy link
Author

@jbash

The UNIX process creation model is wrong from the get-go. fork-and-exec seemed cute at the time, but it turns out that child processes should not just inherit everything from their parents. That decision has left a 50-year trail of horrible security bugs. Especially, if you want to pass an open FD to a child, you should have to explicitly say you want to do that... but most other process state should also only be inherited by positive choice. Signal handlers??? And the total-default-inheritance problem turned out to be a bad combination with the introduction of setuid programs.

It was a great idea in the mid-70s. It stopped being a great idea. That happens! :)

For that matter the UID-based security model itself has also turned out to be a mistake, although a less obvious one and one shared with basically every other system of the day.

Yes. The Windows NT SID /access token / security descriptor model is much better.

And using the word "capability" to describe Linux "capabilities" is outright blasphemy.

Indeed! Linux "capabilities" are nothing at all like what was called 'capabilities' in the literature up to that point. I wish they'd called them "privileges".

@casper-dik
Copy link

The reason why we added a specific system call in Solaris 11.4 to implement posix_spawn() is because of the single threaded original implementation which used vfork() an execve(). Even inside libc, vfork() is a difficult customer as you want to avoid anything that may hurt the parent

While the performance for a single threaded application is the same when comparing with vfork()/execve(), posix_spawn() as a system call no longer needs to stop any threads and multiple concurrent calls to spawn are allowed; the only global resource we need to copy is the file descriptors; but because of the combination of fork()/exec() we do not even copy descriptors marked with FD_CLOEXEC as long as they are not used in the posix_spawn file actions.

This also added a burden to some of the rest of the system: e.g., when a file is opened with O_CLOEXEC you must assign the file descriptor AND set the file descriptors flags at the same time. In vfork()/execve() implementation, all threads had been stopped outside of the kernel or at the beginning or end of a system call.

@NobodyXu
Copy link

@casper-dik This sounds like a very efficient implementation of posix_spawn!
I wish linux also has this.

@nicowilliams
Copy link
Author

nicowilliams commented Jul 12, 2022

Hi @casper-dik. Stopping threads does not solve O_CLOEXEC races! To solve that problem you need new system calls like Linux's pipe2(2) and accept4(), among others, and glibc's stdio fopen()'s e flag. After all, you don't know if you're stopping a thread between a call to pipe(2) and then fcntl(2).

Also, I successfully stopped NetBSD making its vfork(2) stop all parent threads. See this e-mail thread: http://gnats.netbsd.org/49017

IMO the reason to want a posix_spawn() system call is to make it faster.

IMO, making posix_spawn() a system call, but vfork(2) stop all parent threads, comes across as a not-very-nice way to force developers to use posix_spawn().

In the NetBSD case, the reason given for wanting to make vfork(2) stop all of the parent process' threads was that the man page says that it stops the parent. But that "stops the parent" text predates threads, so there's no reason to assume that all of the parent process' threads must be stopped just because the one thread calling vfork(2) must be.

I'm told that the posix_spawn() system call in NetBSD did not perform as well as a the user-land version that used vfrok(2), but I don't really know if that's true.

Lastly, the Linux vfork() (really, clone(2) w/ the CLONE_VFORK flag), does not stop all the parent process' threads, only the one calling it (of course, as it must). In several decades that has not caused anyone any troubles.

@nicowilliams
Copy link
Author

I posted the above reply w/o the @casper-dik, and edited to add it. IDK if GH will notify the user in that case, so just in case: @casper-dik.

@nicowilliams
Copy link
Author

I suppose one thing that the kernel could do to ameliorate O_CLOEXEC races in code that doesn't use the newer system calls is this: treat each FD as having O_CLOEXEC in the time between a) the system call that created it, and the first of b) the next system call in the same thread, or c) the first system call in any thread referring to the new FD. This should be pretty easy to implement.

I.e., heuristically detect situations in which a thread calling fork(2) or vfork(2) cannot know about a new not-yet-O_CLOEXEC FD created by another.

Assuming no user-land-only synchronization, this should be safe because there's no guarantee that one thread or another would win a race between creating a new FD and forking/vforking.

That said, user-land-only synchronization is absolutely possible, but I seriously doubt there is code where a thread uses user-land-only synchronization primitives to wait for another to produce a new FD. I believe that would be the only case where this scheme fails to be safe, and it's exceedingly unlikely.

@casper-dik
Copy link

I think you misinterpret my statement about close-exec race condition; I was talking about the kernel; in vfork()/fork() each thread is sleeping outside a system call; in the kernel posix_spawn(2) implementation was not the case! I noticed that when I recoded popen(3c) to use pipe2(O_CLOEXEC|O_CLOFORK) but hat failed because the fd is assigned and then we set the flags. We needed to change that and assign the fd and set the flags at the same time. Before the in-kernel spawn it worked fine but some assumptions no longer hold.

@nicowilliams
Copy link
Author

@casper-dik ah, ok, that makes sense. Thanks for clarifying that. The fix is as you say that you needed an atomic FD-birth-with-O_CLOEXEC operation in kernel-land. vfork(2) should absolutely not stop the parent processes other threads, only the one that called it. On the BSDs and Linux, vfork(2) does not stop all the parent's threads, just the one that called it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment