shakkhar/concurrent_data_structures.md

## concurrent_data_structures.md

      
    Raw
  

              concurrent_data_structures.md
            
          
    CSC 2/458, 4-11 Feb. 2008
Moir & Shavit, Herlihy & Shavit

combining trees scale but
poor latency under low load
blocking
counting networks get rid of the blocking problem
diffracting trees are basically counting networks that use randomization
to reduce depth, for better performance

Correctness (safety) properties
QUIESCENT CONSISTENCY

all operations appear to occur in some sequential order
nonoverlapping operations appear to occur in real-time order
Assumptions:

each operation accesses a single object
NB:
Operations not separated by quiescence may not occur in program
order.  E.g., I enqueue x and then y; your dequeue operation
overlaps both enqueues, and you come out with y.


SEQUENTIAL CONSISTENCY

all operations appear to occur in some sequential order
the order is consistent with each thread's program order
Assumptions:

each operation accesses a single object
NB:
Sequential consistency is considered by both theoreticians and
architects.  Theoreticians think of operations and objects as
potentially complex.  Architects think of operations and objects
as simple.


LINEARIZABILITY
exists a sequential history H_s such that

H_s is equivalent to (has the same results as) the concurrent history H_c
<_c is consistent with (a subset of) <_s
Equivalently:
the system is sequentially consistent
the sequential order must be consistent with real time; i.e., all
operations appear to happen between their invocation and response
Assumptions:

each operation accesses a single object


SERIALIZABILITY
Like sequential consistency, except that operations may access an
arbitrary number of objects.
NB:
Ability to touch multiple objects in arbitrary order means
serializability is inherently blocking (unless we introduce the
ability to back out and retry).
STRICT SERIALIZABILTY
Like ordinary serializability, except that the sequential order must
be consistent with real time.  Alternatively, like linearizability,
but operations may access an arbitrary number of objects.
NB:
Like ordinary serializability, SS is blocking.


QC
SC
L
S
SS


equiv. to a seq. order
+
+
+
+
+


respects program order
-
+
+
+
+


consistent w/ real time
q
-
+
-
+


op can touch multiple objects
-
-
-
+
+


So SS dominates the other 4 in the sense that any history that is SS is
also QC, SC, L, and S.  Similarly, L dominates QC and SC, and S > SC.
But the stronger properties can get in the way of composition and progress:


QC
SC
L
S
SS


equiv. to a seq. order
+
+
+
+
+


respects program order
-
+
+
+
+


consistent w/ real time
q
-
+
-
+


op can touch multiple objects
-
-
-
+
+


composes?
+
-
+
-
-


nonblocking?
+
+
+
-
-


[NB: in class on Feb. 4 I misspoke in saying that L and SC are
incomparable.  They're not.  The circular picture I tried to draw is not
possibile with linearizability.  Also, the issue of whether we're
talking about a single object or not is a red herring.  The
differentiating factor is whether a single operation can touch multiple
objects or not.]
Study the examples on p. 77 of Herlihy and Shavit.
Note, however, that when we use sequential consistency at the memory
model level, it does, effectively, respect real time, because it's being
applied to the whole system -- there's nothing "external" we can use to
"see" reorderings.  But if you're a computer architect building memory,
making that memory SC system-wide is a challenge precisely because of
non-composability!
To see that S and SS are not compositional, consider Fig. 3.9 of Moir &
Shavit as a history of two transactions (multi-object operations).
The two transactions can't be serialized (or strictly serialized), yet
the subhistories for both p and q can.  I.e., if the underlying system
guarantees that operations on p are serializable, and separately that
operations on q are serializable, operations on p & q together may not
be.  Linearizability composes because (a) it is consistent with real
time, and (b) it doesn't let operations touch multiple objects.  Both
these properties/restrictions are needed.
==> strict serializability and linearizability are equivalent if we
consider all the system's data to be a single object.
Note that anything protected by a single lock is strictly serializable.

Lock-free queues as an example
To be nonblocking, an operation has to have a single instruction at
which it is said to "happen".  We have to ensure that (a) prior to
that point other threads behave as if the operation had not started,
and (b) after that point other threads behave as if the operation
had completed.

CAS version:

    void enqueue(item T, queue *B) {
        qnode *q = new qnode;
        q->t = T;
        q->next = 0;
        while (1) {
            qnode *my_tail = B->tail;
            qnode *my_next = my_tail->next;
            if (my_tail == B->tail) {
                // got consistent snapshot
                if (my_next == 0) {
    /*LP*/          if (CAS(&my_tail->next, my_next, q)) {
                        break;
                    }
                } else {
                    (void) CAS(&B->tail, my_tail, my_next);
                    /* helping */
                }
            }
        }
        (void) CAS(&B->tail, my_tail, my_next);
        /* cleanup */
    }

    bool dequeue(item *T, queue *B) {
        while (1) {
            qnode *my_head = B->head;
            qnode *my_tail = B->tail;
            qnode *my_next = my_head->next;
            if (my_head == B->head) {
                // got consistent snapshot
                if (my_head != my_tail) {
                    *T = my_next->t;
     /*LP*/          if (CAS(&B->head, my_head, my_next) {
                        break;
                     }
                } else {
                    if (my_next == 0) {
                        return FALSE;
                    }
                    (void) CAS(&B->tail, my_tail, my_next);
                }
            }
        }
        delete my_head;
        return TRUE;
    }

In this code, as in most practical examples, the linearization points
can be statically identified.  That isn't always true.  Consider
single-producer, single-consumer queue of Fig. 3.3 in Herlihy & Shavit:
int head = 0, tail = 0
T items[length]
exception full, empty

void enqueue(T x)
    if (tail - head == length) throw full
    items[tail & length] = x
    tail++

T dequeue()
    if (tail - head == 0) throw empty
    T x = items[head % length]
    head++
    return x

Only producer modifies tail; only consumer modifies head.
Both update items before doing so.
Wrap-around of integer precision is safe; subtractions happen on the
ring of integers mod 2^wordsize.
Invariants:
0 <= tail - head <= length
data (if any) that have been produced but not consumed occupy
items[head % length] .. items[tail-1 % length]
To prove this algorithm is correct and non-blocking, we have to verify
the invariants after each individual memory write
Where does this code linearize?
At the increments OR at the read of the second of tail & head, depending
on whether they are successful.

Progress conditions
multiple levels of nonblocking guarantees
wait-free           very strong; generally too expensive --
requires helping
lock-free           can be very fast in ad hoc cases
obstruction-free    moves progress out-of-band; can be quite simple
All three levels are deadlock-free.
Lock-free algorithms are livelock-free.
Wait-free algorithms are starvation-free.
Leader election (consensus) with CAS is wait-free.
M&S queue is lock-free but not wait-free.
Second queue above is wait-free, but ONLY FOR A SINGLE PRODUCER & CONSUMER.
There's a natural obstruction-free deque (that isn't lock-free).
DSTM is also obstruction-free but not lock-free.
Anything can, in principle, be made wait-free, but the construction is
messy (impractical).  Intuition:

shared /announce/ array of high-level op descriptors, indexed by thread
per-object /responses/ array of result info, also indexed by thread
before I perform an op on obj. X I scan the two arrays and help
any op that hasn't completed yet
performing an op involves

indirection to root of every object
copying the whole thing
checking to make sure the copy is consistent
creating a new version
installing it with CAS


lots of messy race conditions.  Also ABA.


Coping with relaxed consistency
Whenever memory operations must occur in the order specified in the
source, and that order is (a) not forced by sequential data
dependences and (b) not required by the HW memory model, we must
insert explicit "fence" instructions to force the order we want.
On the SPARC under TSO (total store order) the only dangerous case
is a write followed by a read of a different location.  We force
order here with a "membar #StoreLoad" instruction.  On the (much
more relaxed) Power4 the hardware is not required to respect any
ordering not forced by sequential data dependences, and we may need
a lot more fences.  There are several different kinds, with
incomparable properties (more on this later in the semester).
The code above works if the memory system is sequentially
consistent.  On other machines, it's non-trivial to figure out where
to put the barriers (and what kinds) .
For now: as long as you're on the SPARC or x86, the only thing you have
to worry about is any case in which you write A and then read B, and the
operations HAVE to occur in that order.  If they do, but a "membar
#StoreLoad" in-between.  Example:
        A = B = 0
thread 1        thread 2
A = 1           B = 1
C = B           D = A
print C         print D

Don't want to get 0, 0.
Such cases are uncommon, but not unheard-of.

Preventing caching in registers
Any variable whose value may change due to action outside the
current thread must be declared 'volatile'.  This prevents the
compiler from caching it in a register.  Basically it's accessed in
memory every time it's mentioned in the program.  More precisely, in
ISO C it's accessed in memory at every "sequence point" in the code,
which is roughly every top-level expression.
typedef struct qnode {
    item t;
    struct qnode *volatile next;
} qnode;

typedef struct {
    qnode *volatile head;
    qnode *volatile tail;
} queue;

Note the difference between
volatile foo *p;
and
foo *volatile p;
NB: ISO/ANSI C 'volatile' in and of itself actually does not
suffice to guarantee thread safety.  ISO/ANSI C doesn't address
threads at all!  'Volatile' is meant for I/O registers: compiler
can't keep in registers accross sequence points, but it can
reorder, subject to sequential data dependences.  Many compilers
(including gcc) decline to reorder, in which case you're ok.
ALSO: POSIX-compliant C compilers do understand threads, but only
POSIX threads.  If you use (only) pthread mutexes and barriers, you
don't need 'volatile' at all.
Unfortunately, if you want to use non-POSIX synchronization, which
isn't recognized by the compiler, you need 'volatile' (with extended
semantics).  This inhibits important compiler optimizations.  The
usual programming idiom is to access shared data sparingly: read and
write to/from local variables once per critical section or barrier;
do all actual computation in local variables whenever possible.  NOT
an ideal solution.
	QC	SC	L	S	SS
equiv. to a seq. order	+	+	+	+	+
respects program order	-	+	+	+	+
consistent w/ real time	q	-	+	-	+
op can touch multiple objects	-	-	-	+	+