Bike/threads_informal.md Secret

## threads_informal.md

      
    Raw
  

              threads_informal.md
            
          
    Concurrency

Threads allow the programmer to express that some evaluations may take place in any order, or simultaneously, without changing the semantics of the program. For example:
(defun example-single ()
  (let ((a (compute1))
        (b (compute2)))
    (values a b)))
(defun example-multi ()
  (let (a b)
     (let ((thread1 (make-thread (lambda () (setf a (compute1)))))
           (thread2 (make-thread (lambda () (setf b (compute2))))))
       (join-thread thread1)
       (join-thread thread2)
       (values a b))))
The example-single and example-multi functions return the same two values - whatever the compute1 and compute2 operators return. However, they have slightly different semantics: in example-single, the evaluation of compute1 happens before the evaluation of compute2, while in example-multi the order these evaluations occur is left up to the implementation; (compute1) could be evaluated before, after, or even during the evaluation of (compute2). This is the essential nature of concurrency.
Let's examine the idea of things "happening before" other things more closely. In example-single, the definition of the Lisp evaluator (e.g. CLHS 3.1) implies that (compute1) happens before (compute2). In example-multiple things are more complex. We can see that the first make-thread happens before the second make-thread, which happens before the first join-thread, which happens before the second join-thread, which happens before the values are returned, according again to the usual evaluation rules. But this doesn't cover (compute1) or (compute2). By the definition of make-thread, we know that the first make-thread call happens before the first thread starts or finishes; as such, the first make-thread happens before (compute1), and similarly with the second make-thread and (compute2). By the definition of join-thread, we know that join-thread will not return until its thread has completed, which means that (compute1) happens before the first join-thread returns, and (compute2) happens before the second join-thread returns.
That is all of the constraints. It is not defined whether (compute1) happens before (compute2), as we'd expect. An implementation without multiprocessing could implement make-thread here simply as funcall, in which case (compute1) would happen before (compute2); or an implementation could reverse the calls so that (compute2) happens before (compute1); or an implementation with multiprocessing could execute them simultaneously, in which case it doesn't make sense to say that one happens before the other at all. The programmer has left this choice up to the implementation. The programmer can still know certain "happens before" relations, however: It is defined that both computations complete before example-multi returns.
This control of ordering is what makes concurrency useful. Concurrent programming in large part consists of deciding what constraints on the "happens before" ordering can and can't be removed.
We have two basic kinds of constraints on the order of what evaluations happen before what other operations: The constraints imposed by the normal evaluation rules within a thread, and the constraints imposed by make-thread and join-thread across threads. We call the former a "sequencing" order, because it's based on the normal sequence of evaluation, and the latter a "synchronization" order, because it imposes some synchronization between threads that does not otherwise exist. Both kinds of constraints are noncommutative and transitive. In a program with concurrency, the constraints do not impose a necessary total order: Many distinct "happens before" orderings can be consistent with the same program.
The above lead to the following more formal definitions:

An evaluation A is sequenced-before another evaluation B in the same thread iff A is defined to occur before B by the usual Lisp evaluation rules.
Some evaluations synchronize-with other evaluations, as defined by the particular semantics of those evaluations; for example, in the above, each make-thread call synchronizes with that thread's computation because that is how make-thread is defined.
An evaluation A happens-before another evaluation B if any of the following is true:

A is sequenced-before B
A synchronizes-with B
There is some other evaluation X such that A happens-before X, and X happens-before B (i.e. happens-before is transitive).


Conflicts

So far so good, but this is not enough for much concurrent programming. If compute1 has some side effect that could be visible to compute2 - say, compute1 modifies the value of a global dynamic variable *resource*, which compute2 then uses - we have what is called a data race. The semantics of the program may become unclear.
Let's refine the above example as follows, giving compute1 and compute2 actual (and identical) definitions, and adding one binding and result:
(defun example-single ()
  (let* ((counter 0)
         (a (incf counter))
         (b (incf counter)))
    (values a b counter)))
(defun example-multi ()
  (let* ((counter 0)
         a b
         (thread1 (make-thread (lambda () (setf a (incf counter)))))
         (thread2 (make-thread (lambda () (setf b (incf counter))))))
    (join-thread thread1)
    (join-thread thread2)
    (values a b counter)))
Given the basic evaluation rules, (example-single) returns 1 2 2. example-multi is again more complex. (incf counter) could macroexpand to (setf counter (1+ counter)). We can break that down into three evaluations: First counter, a read of the shared variable; then (1+ ...), computing the sum of the value read and 1; and finally (setf counter ...) which writes the shared variable.
While we know that the read of counter happens before the 1+ in one particular thread, but there is no constraint on the happens-before order between the two threads. For example, we might have the same order as in example-single:
thread 1: read counter
thread 1: 1+
thread 1: setf counter
thread 2: read counter
thread 2: 1+
thread 2: setf counter

in which case the results would be 1 2 2 as in example-single. Or, all of the thread 2 operations could take place before all of the thread 1, in which case the results would be 2 1 2. Less intuitively, we could have the following order:
thread 1: read counter
thread 2: read counter
thread 1: 1+
thread 2: 1+
thread 1: setf counter
thread 2: setf counter

giving us the results 1 1 1! We didn't even manage to increment the counter twice. The programmer might have expected the counter to end up at 2 even if they didn't care about which order the threads' evaluations happened in.
In this scenario, we say that the two writes conflicted. Furthermore, because there is no constraint on the happens-before ordering of the writes, we say that the operations may form a data race - the two threads are "racing" to complete.
More formally:

Two accesses to a place conflict if at least one is a write. (As such, a conflict by itself is not a problem.)
If two accesses to a place conflict, and neither access happens before the other, there is a data race.

Synchronization

Mutual exclusion/locks

One solution to the above problem is mutual exclusion. If two evaluations are "mutually exclusive", that means they are not allowed to take place simultaneously, even if they occur in different threads. Another phrasing of this is that one evaluation must happen before the other (although which takes place first is unspecified). Mutual exclusion is mediated by the use of locks (sometimes also called mutexes, from "MUTual EXclusion").
Using locks, we could rewrite example-multi from the last section as
(defun example-multi ()
  (let* ((counter 0)
         (lock (make-lock))
         a b
         (thread1 (make-thread (lambda () (setf a (with-lock-held (lock) (incf counter))))))
         (thread2 (make-thread (lambda () (setf b (with-lock-held (lock) (incf counter)))))))
    (join-thread thread1)
    (join-thread thread2)
    (values a b)))
Because both with-lock-held operations involve the same lock, one must happen before the other; that is the definition of with-lock-held. That means that the reads and writes can't be interleaved as in the last happens-before ordering in the previous section. Only two orderings are possible: either thread1 increments counter from 0 to 1 and then thread2 increments it from 1 to 2, or thread2 increments it from 0 to 1 and then thread1 increments it from 1 to 2. The possible return values of example-multi are either 1 2 2 or 2 1 2 and 1 1 1 is no longer possible.
Acquisition and release

How can we formalize the operation of with-lock-held? We can define it in terms of synchronization operations, like how make-thread and join-thread are defined.
with-lock-held is a macro masking simpler operations. One possible definition of with-lock-held would be (eliding hygeine concerns for simplicity):
(defmacro with-lock-held ((lock) &body body)
  `(progn
     (acquire-lock ,lock)
     (unwind-protect
          (progn ,@body)
       (release-lock ,lock))))
First the lock is acquired, then the body is evaluated, and finally the lock is released. If two threads reach an acquire-lock call at the same time, one succeeds while the other must wait until the acquiring thread releases it with release-lock. This ensures the mutual exclusion we want.
In order for this to work, (successful) acquisition and release operations on any given lock must take place in a definite order, which we call the synchronization order, which is identical across all threads. The order cannot include an acquisition immediately followed by an acquisition - there must be a release between any two acquisitions (a with-lock-held body evaluation cannot begin while another is still going). And clearly the order must be consistent with how the program actually executes, so e.g. a thread can't release a lock until after it acquires it.
To fit this into the understanding of "happens-before" ordering defined above, we can then say that a release synchronizes with the next acquisition in the synchronization order, and that synchronization orders must be consistent with the happens before ordering.
The synchronization order for a lock, that is the order in which the acquisitions and releases of that lock occur, is not completely defined by the program. In our example, the programmer does not care which thread completes its work first, as long as one is finished before the other starts. Multiple synchronization orders for one lock may be consistent with the same program.
Formalizing this:

Releases and (successful) acquisitions of any given lock take place in an order consistent across all threads.
For any release in the order, the immediately prior acquisition in the order is sequenced before that release, i.e. it took place in the same thread.
A release synchronizes with the next acquisition in the lock's synchronization order, if there is one.

Atomicity

Threads and locks are sufficient to develop and describe programs in which there are no data races as defined above. That essentially means that if any thread reads a place that is written by another thread, that write must unambiguously take place before or after the read, as enforced by make-thread/join-thread semantics or by mutual exclusion. That's very often adequate, but sometimes finer grained control would be more useful.
For this to work, we have to have a better idea of what happens when there is a data race, which we haven't explicitly defined. For the example-multi function above, we found that 1 2 2, 2 1 2, or 1 1 1 could be returned, based on possible happens-before orders. To do this we implicitly relied on the atomicity of the read, 1+, and write operations - we treated these operations as if they could not be divided into smaller operations, which could be further rearranged. If this assumption is correct, we can define the possible behaviors of this program completely, despite the data race.
At the same time it's easy to imagine non atomic accesses. A write to (subseq vector 0 1) can be understood as a write to (aref vector 0) and a write to (aref vector 1). Even if these aref accesses are atomic, the following function could return any of #(1 1), #(2 2), #(1 2), or #(2 1). The later two values are notably not anything written to the place by any actual setf in the program: they're a joint effort. This is called tearing.
(defun example-tear ()
  (let* ((v (make-array 2))
         (thread1 (make-thread (lambda () (setf (subseq v 0 1) #(1 1)))))
         (thread2 (make-thread (lambda () (setf (subseq v 0 1) #(2 2))))))
    (join-thread thread1)
    (join-thread thread2)
    v))
We can refine our definition of a data race to allow atomic accesses as follows:

If two accesses to a place conflict, and neither access happens before the other, and at least one access is not atomic, there is a data race. The effect of evaluating an access that causes a data race is undefined.

Atomic read-modify-write operations

We haven't defined what places or accesses might be atomic. This is essentially defined by the implementation's capabilities. Some machines may be able to perform atomic accesses to various places and others may not be.
But on most modern machines, not only are many writes and reads atomic, some read-modify-write operations can be as well. For example, an implementation may have an atomic-incf availabile; then we can write the earlier example-multi as
(defun example-multi ()
  (let* ((counter 0)
         a b
         (thread1 (make-thread (lambda () (setf a (atomic-incf counter)))))
         (thread2 (make-thread (lambda () (setf b (atomic-incf counter))))))
    (join-thread thread1)
    (join-thread thread2)
    (values a b counter)))
Because atomic-incf is atomic, we know that the interleaving behavior we found earlier is impossible - one atomic-incf does not start until the other completes. This is the same as when we accomplished this with locks. Now (example-multi) can return 1 2 2 or 2 1 2 but not 1 1 1, and we have an atomic counter.
A common general atomic read-modify-write operation is called compare-and-swap. A compare-and-swap reads a place, checks if the value is equivalent to an input "old" value; if it is, an input "new" value is written into the place and the old value is returned; if it's not, no write takes place and the read value is returned. In Lisp (skipping multiple evaluation concerns and hygeine for simplicity):
(defmacro compare-and-swap (place old new)
  `(let ((val ,place))
     (when (eql ,old val) (setf ,place ,new))
     val))
An atomic compare-and-swap does all of this atomically. It can be used to implement many other atomic operations, including atomic-incf, an atomic push, etc.
Sequential inconsistency

This is where it gets complicated.
The above analysis of atomic operations has been making another assumption: That evaluations, or at least atomic accesses, take place in one order that is consistent across all threads. This property, called sequential consistency (because there is a consistent sequence), seems intuitive and basic, but is actually a very strong condition.
For example, consider a somewhat more complicated example program. do-work produces an input and output filename based on its integer input. It then passes these filenames to transform, which reads from the input file and writes to the output file. Once it's complete, it marks its completion in an external vector by storing T into that vector at its index.
(defun do-work (n vector)
  (let ((ifile (format nil "input~d.db" n))
        (ofile (format nil "output~d.db" n)))
    (transform ifile ofile))
  (setf (svref vector n) t)
  (values))
Now, an implementation may decide to optimize this a bit. Say that it somehow knows that transform does not have access to the vector from elsewhere. Since n is just a nonnegative integer and vector is just a vector, and neither of these variables can be mutated by anything transform does, an implementation could essentially rewrite do-work as
(defun do-work (n vector)
  (setf (svref vector n) t)
  (let ... (transform ifile ofile))
  (values))
which could free up registers and stack space, since now and vector can be discarded as soon as the vector is set, and n can be discarded once the filenames are computed. Assuming transform indeed doesn't access the vector, it is not possible for the programmer to determine that this transform occurred - it has the same semantics, behaving as though everything occurs in the obvious sequence-before order - so it's essentially legitimate.
But now say the programmer wants to parallelize the work. They might write something like
(defun parallel-work (max)
  (let* ((vector (make-array max :initial-element nil))
         (threads (loop for i below max
                        collect (let ((i i))
                                  (make-thread
                                   (lambda ()
                                     (let ((ifile (format nil "input~d.db"))
                                           (ofile (format nil "output~d.db")))
                                       (transform ifile ofile)
                                       (setf (svref vector i) t))))))))
    (loop while (find #'null vector)
          do (sleep 1)
             (display-progress vector))))
This makes one worker thread for each file being transformed. Each worker thread computes the filenames, passes them to transform, and then once the transform is complete it marks its completion in the vector and the thread exits. While the threads are working, the main thread displays the progress so far to the user.
This seems simple enough. However, the implementation may apply the same operation as in the single threaded case, so that each worker thread's function becomes (lambda () (setf (svref vector i) t) (let ...)). Now every thread marks completion immediately to save some registers, and the main thread thinks all the worker threads are complete before they actually are. This could cause some pretty serious problems.
Similar optimizations can apply even in cases where a thread does not discard values as here. Consider a similar system where the vector is instead initialized to zero. Each worker thread computes some function of its index, and stores the result in the vector at the index; then, once it and the threads with nearby indices have completed that, it sums its neighbors and puts those in a separate vector. So each worker thread function consists of (neglecting the end indices for simplicity):
(lambda ()
  (let ((cn (compute n)))
    (setf (svref vector1 n) cn)
    (loop while (and (zerop (svref vector1 (1- n)))
                     (zerop (svref vector1 (1+ n)))))
    (setf (svref vector2 n)
          (+ (svref vector1 (1- n))
             (svref vector1 n)
             (svref vector1 (1+ n))))))
The implementation could optimize away the second read of the vector at n, and delay the write until the end:
(lambda ()
  (let ((cn (compute n)))
    (loop while (and (zerop (svref vector1 (1- n)))
                     (zerop (svref vector1 (1+ n)))))
    (setf (svref vector2 n)
          (+ (svref vector1 (1- n))
             cn
             (svref vector1 (1+ n))))
    (setf (svref vector1 n) cn)))
Again, from a single-threaded perspective, this is perfectly legitimate and has the same semantics. With parallelism, however, it's disastrous: each thread waits patiently for its neighbors to write its result before writing its own, and so they all wait forever.
These sorts of optimizations are extremely common on modern computers, and contribute a great deal to their efficiency. Turning them off globally would greatly slow down most programs. Almost no programs actually need sequential consistency for all of their places: only concurrent programs need to have this concern at all, and in those, only a few places (e.g. the svref places above) need to be accessed sequential-consistently. As such, the burden is placed on the programmer to describe to the implementation where sequential consistency is required.
Atomic acquisition and release

Above, we dealt with operations acquiring and releasing locks. With an understanding of atomic operations, we can see how locks can be implemented on a more fundamental level. A form of consistency that is weaker than sequential consistency can be used.
Acquisition of a lock can only be done after a release of a lock, unless it's the first time a lock is acquired. Because the release synchronizes with the acquisition, evaluations that happen before the release happen before the acquisition as well, even if they occur in other threads. This means that optimizations of the kind described above must be partly suppressed. Specifically, any writes that happen before (e.g. are sequenced before) the release must be complete when the release is complete, so that an acquiring thread can observe their effects; and any read that the acquisition happens before must not start until the acquisition is complete, so that those reads can observe any effects from the releasing thread. In short, writes cannot be moved forward across a release, and reads cannot be moved backwards across an acquisition.
We can imagine implementing locks with a simple binary flag. If the flag is on, the lock is being held; if the flag is off, it's up for grabs. Releasing a lock then consists of marking the flag as off. To acquire a lock, the flag is read; if the flag is on, the acquiring thread waits; otherwise it marks the flag as on and continues. This lock acquisition be an atomic read-modify-write to avoid the possibility of another thread marking the flag on in the time between the read and the write.
Releasing the lock - the write to mark the flag as off - must be a "release" operation in the above sense, so that optimizations are appropriately inhibited. When acquiring a lock, the read must be an "acquire" operation in the above sense so as to inhibit optimization, and the write immediately thereafter must be a "release" operation, in order to avoid another acquisition from being moved across it.
Now to fit this into the semantics, we can just need to describe some synchronizes-with edges. The initial write of a value synchronizes-with (or, just happens to be sequenced-before) some acquire operation that reads that value. For any other acquire operation, there is a release operation that writes the value the acquire reads, and that release synchronizes-with that acquire. And that's it. Although acquisitions and releases were originally contemplated for locks, really any atomic read operation can be an acquisition, and any atomic write can be a release. Atomic read-modify-write operations may be both acquisitions (the read) and releases (the write), as with the lock acquisition above.
Sequentially consistent reads and writes are acquire and release operations, respectively. Sequential consistency is a stronger condition.
Fencing

Machines generally implement acquisition and release semantics - suppression of the reordering the CPU usually does - with "fence" instructions. Conceptually, a fence blocks writes or reads from being moved across it, like a fence keeping livestock in the field.
Fences can be fitted into our new semantics as well. A fence can be a release, or an acquisition, or both. A release fence synchronizes with an acquire operation in another thread, if that acquire reads the value written by an atomic (not necessarily release!) write that is sequenced before the fence. A release operation synchronizes with an acquire fence if the fence is sequenced before an atomic (not necessarily acquire!) read that reads the value written by the release. And a release fence can synchronize with an acquire fence as well [need to explain that].
Relaxed ordering

Sometimes it is useful to have atomic operations that are not synchronization operations. For example, take an atomic counter, as in the example-multi function above. In this case, not only do we not care which increment happens first, we don't care about synchronizing the threads at all; all we want for the counter to be properly incremented once by each thread. Moving other accesses around the increments is perfectly fine. All that is needed is a modification order for this one particular place, the counter. One atomic increment reads the initial binding, and writes one plus that binding. The other thread reads that written value, and writes one plus what it reads.
For "relaxed" or "monotonic" operations on a place, there is some consistent total order of reads and writes on that place, called the modification order. Any read sees the value written by whichever write is in the modification order before the read but after all other such writes. Because there is no synchronization, the fact that a relaxed write occurs before a relaxed read in the modification order does not imply that the write actually happens before the read. In fact in general, there is no reason for modification orders on multiple places to be put together into a consistent ordering at all.
Acquisitions, releases, and sequentially consistent reads and writes of a place occur in the modification order in some order consistent with the happens before ordering; these are all stronger conditions than relaxedness.
Unordering

Sometimes it is useful to say that an access is atomic, and nothing else. In this case we say the access is unordered. Within a thread, unordered accesses are like any other accesses - a read sees the value written by whichever write was most recent. But if there's a conflict, a read could see the value written by any write consistent with the happens before ordering. That is, a read won't see a value written by an access the read happens before, or a value written by an access that happens before another write that happens before the read, but other than that it could see anything.
Unordered accesses are not really useful for coordinating threads. For example, without further synchronization, it would be permissible for an unordered read to never see a value written by another thread, no matter how much time has passed. This could happen if compiler rewrites, caching, etc. conspire to give a place in a thread its own location never accessed by another thread.
Since they are nonetheless atomic, unordered accesses are a useful safety guarantee. Their behavior in the face of conflicts is weakly defined but defined nonetheless. The programmer can be sure that an unordered read will get some value written elsewhere, and not the mutant combination of multiple reads, or uninitialized garbage.