Skip to content

Instantly share code, notes, and snippets.

@shakkhar
Last active August 14, 2022 21:21
Show Gist options
  • Save shakkhar/ad97ccd8bbe46de1345c63eee77c974c to your computer and use it in GitHub Desktop.
Save shakkhar/ad97ccd8bbe46de1345c63eee77c974c to your computer and use it in GitHub Desktop.

CSC 2/458, 4-11 Feb. 2008

Moir & Shavit, Herlihy & Shavit


combining trees scale but poor latency under low load blocking counting networks get rid of the blocking problem diffracting trees are basically counting networks that use randomization to reduce depth, for better performance


Correctness (safety) properties

QUIESCENT CONSISTENCY

  • all operations appear to occur in some sequential order
  • nonoverlapping operations appear to occur in real-time order Assumptions:
    • each operation accesses a single object NB: Operations not separated by quiescence may not occur in program order. E.g., I enqueue x and then y; your dequeue operation overlaps both enqueues, and you come out with y.

SEQUENTIAL CONSISTENCY

  • all operations appear to occur in some sequential order
  • the order is consistent with each thread's program order Assumptions:
    • each operation accesses a single object NB: Sequential consistency is considered by both theoreticians and architects. Theoreticians think of operations and objects as potentially complex. Architects think of operations and objects as simple.

LINEARIZABILITY exists a sequential history H_s such that

  • H_s is equivalent to (has the same results as) the concurrent history H_c
  • <_c is consistent with (a subset of) <_s Equivalently:
  • the system is sequentially consistent
  • the sequential order must be consistent with real time; i.e., all operations appear to happen between their invocation and response Assumptions:
    • each operation accesses a single object

SERIALIZABILITY Like sequential consistency, except that operations may access an arbitrary number of objects. NB: Ability to touch multiple objects in arbitrary order means serializability is inherently blocking (unless we introduce the ability to back out and retry).

STRICT SERIALIZABILTY Like ordinary serializability, except that the sequential order must be consistent with real time. Alternatively, like linearizability, but operations may access an arbitrary number of objects. NB: Like ordinary serializability, SS is blocking.

QC SC L S SS
equiv. to a seq. order + + + + +
respects program order - + + + +
consistent w/ real time q - + - +
op can touch multiple objects - - - + +

So SS dominates the other 4 in the sense that any history that is SS is also QC, SC, L, and S. Similarly, L dominates QC and SC, and S > SC. But the stronger properties can get in the way of composition and progress:

QC SC L S SS
equiv. to a seq. order + + + + +
respects program order - + + + +
consistent w/ real time q - + - +
op can touch multiple objects - - - + +
composes? + - + - -
nonblocking? + + + - -

[NB: in class on Feb. 4 I misspoke in saying that L and SC are incomparable. They're not. The circular picture I tried to draw is not possibile with linearizability. Also, the issue of whether we're talking about a single object or not is a red herring. The differentiating factor is whether a single operation can touch multiple objects or not.]

Study the examples on p. 77 of Herlihy and Shavit. Note, however, that when we use sequential consistency at the memory model level, it does, effectively, respect real time, because it's being applied to the whole system -- there's nothing "external" we can use to "see" reorderings. But if you're a computer architect building memory, making that memory SC system-wide is a challenge precisely because of non-composability!

To see that S and SS are not compositional, consider Fig. 3.9 of Moir & Shavit as a history of two transactions (multi-object operations). The two transactions can't be serialized (or strictly serialized), yet the subhistories for both p and q can. I.e., if the underlying system guarantees that operations on p are serializable, and separately that operations on q are serializable, operations on p & q together may not be. Linearizability composes because (a) it is consistent with real time, and (b) it doesn't let operations touch multiple objects. Both these properties/restrictions are needed.

==> strict serializability and linearizability are equivalent if we consider all the system's data to be a single object.

Note that anything protected by a single lock is strictly serializable.


Lock-free queues as an example

To be nonblocking, an operation has to have a single instruction at
which it is said to "happen".  We have to ensure that (a) prior to
that point other threads behave as if the operation had not started,
and (b) after that point other threads behave as if the operation
had completed.

CAS version:

    void enqueue(item T, queue *B) {
        qnode *q = new qnode;
        q->t = T;
        q->next = 0;
        while (1) {
            qnode *my_tail = B->tail;
            qnode *my_next = my_tail->next;
            if (my_tail == B->tail) {
                // got consistent snapshot
                if (my_next == 0) {
    /*LP*/          if (CAS(&my_tail->next, my_next, q)) {
                        break;
                    }
                } else {
                    (void) CAS(&B->tail, my_tail, my_next);
                    /* helping */
                }
            }
        }
        (void) CAS(&B->tail, my_tail, my_next);
        /* cleanup */
    }

    bool dequeue(item *T, queue *B) {
        while (1) {
            qnode *my_head = B->head;
            qnode *my_tail = B->tail;
            qnode *my_next = my_head->next;
            if (my_head == B->head) {
                // got consistent snapshot
                if (my_head != my_tail) {
                    *T = my_next->t;
     /*LP*/          if (CAS(&B->head, my_head, my_next) {
                        break;
                     }
                } else {
                    if (my_next == 0) {
                        return FALSE;
                    }
                    (void) CAS(&B->tail, my_tail, my_next);
                }
            }
        }
        delete my_head;
        return TRUE;
    }

In this code, as in most practical examples, the linearization points can be statically identified. That isn't always true. Consider single-producer, single-consumer queue of Fig. 3.3 in Herlihy & Shavit:

int head = 0, tail = 0
T items[length]
exception full, empty

void enqueue(T x)
    if (tail - head == length) throw full
    items[tail & length] = x
    tail++

T dequeue()
    if (tail - head == 0) throw empty
    T x = items[head % length]
    head++
    return x

Only producer modifies tail; only consumer modifies head. Both update items before doing so. Wrap-around of integer precision is safe; subtractions happen on the ring of integers mod 2^wordsize. Invariants: 0 <= tail - head <= length data (if any) that have been produced but not consumed occupy items[head % length] .. items[tail-1 % length] To prove this algorithm is correct and non-blocking, we have to verify the invariants after each individual memory write

Where does this code linearize? At the increments OR at the read of the second of tail & head, depending on whether they are successful.


Progress conditions

multiple levels of nonblocking guarantees wait-free very strong; generally too expensive -- requires helping lock-free can be very fast in ad hoc cases obstruction-free moves progress out-of-band; can be quite simple

All three levels are deadlock-free. Lock-free algorithms are livelock-free. Wait-free algorithms are starvation-free.

Leader election (consensus) with CAS is wait-free. M&S queue is lock-free but not wait-free. Second queue above is wait-free, but ONLY FOR A SINGLE PRODUCER & CONSUMER. There's a natural obstruction-free deque (that isn't lock-free). DSTM is also obstruction-free but not lock-free.

Anything can, in principle, be made wait-free, but the construction is messy (impractical). Intuition:

  • shared /announce/ array of high-level op descriptors, indexed by thread
  • per-object /responses/ array of result info, also indexed by thread
  • before I perform an op on obj. X I scan the two arrays and help any op that hasn't completed yet
  • performing an op involves
    • indirection to root of every object
    • copying the whole thing
    • checking to make sure the copy is consistent
    • creating a new version
    • installing it with CAS
  • lots of messy race conditions. Also ABA.

Coping with relaxed consistency

Whenever memory operations must occur in the order specified in the source, and that order is (a) not forced by sequential data dependences and (b) not required by the HW memory model, we must insert explicit "fence" instructions to force the order we want.

On the SPARC under TSO (total store order) the only dangerous case is a write followed by a read of a different location. We force order here with a "membar #StoreLoad" instruction. On the (much more relaxed) Power4 the hardware is not required to respect any ordering not forced by sequential data dependences, and we may need a lot more fences. There are several different kinds, with incomparable properties (more on this later in the semester).

The code above works if the memory system is sequentially consistent. On other machines, it's non-trivial to figure out where to put the barriers (and what kinds) .

For now: as long as you're on the SPARC or x86, the only thing you have to worry about is any case in which you write A and then read B, and the operations HAVE to occur in that order. If they do, but a "membar #StoreLoad" in-between. Example:

        A = B = 0
thread 1        thread 2
A = 1           B = 1
C = B           D = A
print C         print D

Don't want to get 0, 0. Such cases are uncommon, but not unheard-of.


Preventing caching in registers

Any variable whose value may change due to action outside the current thread must be declared 'volatile'. This prevents the compiler from caching it in a register. Basically it's accessed in memory every time it's mentioned in the program. More precisely, in ISO C it's accessed in memory at every "sequence point" in the code, which is roughly every top-level expression.

typedef struct qnode {
    item t;
    struct qnode *volatile next;
} qnode;

typedef struct {
    qnode *volatile head;
    qnode *volatile tail;
} queue;

Note the difference between volatile foo *p; and foo *volatile p;

NB: ISO/ANSI C 'volatile' in and of itself actually does not suffice to guarantee thread safety. ISO/ANSI C doesn't address threads at all! 'Volatile' is meant for I/O registers: compiler can't keep in registers accross sequence points, but it can reorder, subject to sequential data dependences. Many compilers (including gcc) decline to reorder, in which case you're ok.

ALSO: POSIX-compliant C compilers do understand threads, but only POSIX threads. If you use (only) pthread mutexes and barriers, you don't need 'volatile' at all.

Unfortunately, if you want to use non-POSIX synchronization, which isn't recognized by the compiler, you need 'volatile' (with extended semantics). This inhibits important compiler optimizations. The usual programming idiom is to access shared data sparingly: read and write to/from local variables once per critical section or barrier; do all actual computation in local variables whenever possible. NOT an ideal solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment