polytypic/slides.md

## slides.md

      
    Raw
  

              slides.md
            
          
  title
  
  
  k-CAS for sweat-free concurrent programming
  
  
k-CAS for sweat-free concurrent programming

Vesa Karvonen

(Press 's' for speaker notes, which are for
self-study.)
Note: Welcome! This talk is based on my work polishing the OCaml multicore
kcas library. It started
when Bartosz asked me to review a couple of his PRs on the library and I got
really interested in it as I have previously implemented prototypes of similar
libraries. So, I basically took the liberty to work on kcas. I don't recall
having this much fun programming for a long time so I hope you'll forgive me for
pushing some of my other tasks aside for a moment!

Contents


Lock-free programming with CAS and friends 😰
Sweat-free programming with k-CAS Txs 😌
The lock-free k-CAS algorithm 🤓
New obstruction-free k-CAS-n-CMP algorithm 🤔
Blocking transactions ✋

Note: Here is a quick overview of what I plan to talk about. First I want to
give a super quick introduction to lock-free programming with CAS or
compare-and-set as a motivation for k-CAS. Then I'll talk about the new
transactional API for programming with k-CAS. After that I'll describe the k-CAS
algorithm that the library is based on. Next I'll motivate and describe a
brand-new algorithm for k-CAS-n-CMP based on the earlier algorithm. For the last
part we'll transport to the future. I'll sketch out an idea of how kcas could
be extended to support blocking transactions and implement proper STM or
Composable Memory Transactions.

1. Lock-free programming with CAS and friends

🤠 or 😰 ?
Note: First I want to talk a little bit about lock-free programming with CAS. Is
it the cowboy coder's dream and the professional programmer's nightmare?

Lock-free?


An algorithm is
lock-free
if, when the program threads are run for a sufficiently long time, at least
one of the threads makes progress.

A panacea?
On the Nature of Progress
Note: But first, what does lock-free actually mean? Is it a cure for all
diseases? Well, no. Starvation can be a real issue, for example. Also, lock-free
algorithms do not necessarily perform better than algorithms using locks. They
can cause just as much or even more contention as they tend to make more, not
less, accesses to shared memory. I should also mention that so called wait-free
algorithms also guarantee lack of starvation, but tend to come with a heavier
price.

Treiber stack (1/2)

type 'a t
val create : unit -> 'a t
val push : 'a t -> 'a -> unit
val try_pop : 'a t -> 'a option
type 'a t = 'a list Atomic.t
let create () = Atomic.make []
Note: Enough with the theory. Let's implement a lock-free stack called the
Treiber stack. Here is a
completely standard signature for such a stack. You could implement that
signature in many ways. We'll implement it using a single atomic location and
CAS.

Treiber stack (2/2)

let rec push backoff stk x =
  let xs = Atomic.get stk in
  if not (Atomic.compare_and_set stk xs (x::xs)) then
    push (Backoff.once backoff) stk xs

let push stk x = push Backoff.default stk x

let rec try_pop backoff stk =
  match Atomic.get stk with
  | [] -> None
  | x::xs as xxs ->
    if Atomic.compare_and_set stk xxs xs then Some x else
      try_pop (Backoff.once backoff) stk

let try_pop stk = try_pop Backoff.default stk
Why is this lock-free?   Is this faster than using a lock?
Note: The push and try_pop demonstrate typical programming patterns using CAS.
* First of all, an Atomic.get is performed to read the current value of the
atomic location. * Then an Atomic.compare_and_set is performed to update the
location. * If that fails, a backoff is applied to avoid contention and the
operation is retried. * Can you tell what makes this lock-free? Well, assuming
that the CPU guarantees that at least one correct CAS succeeds when multiple
such operations are attempted simultaneously then at least one domain will make
progress. As simple as the Treiber stack is, every retry causes costly
shared memory traffic.
A scalable spin-lock such as CLH spin-lock (Craig, Landin, and Hagersten) may
actually generate less shared memory traffic and perform better.

CAS causes sweating


Lack of composability

Challenge: Move an element between stacks atomically


Makes easy things possible and hard things research problems

TSan,
dscheck, ...


ABA problem


Note: The Treiber stack is perhaps not the best example to demonstrate how
subtle it is to program using CAS. As a challenge, you could try to implement an
operation to move an element from one stack to another atomically. Good luck!
The truth is that lock-free algorithms using CAS are extremely hard to reason
about. That is part of why we also have tools to help with that. I should also
mention that there are some known gotchas with CAS such as the so called ABA
problem that can bite you e.g. when reusing objects. For lack of time I'm going
to skip talking about that.

If CAS doesn't work, you aren't using enough


DCAS, RDCSS, DCSS, 1-CAS-n-CMP, ...

Special cases e.g. linked list operations


k-CAS: [CAS(x,0,1); CAS(y,1,0)]
k-CAS-n-CMP

Can be simulated with k-CAS as CAS(x,a,a) is
equivalent to CMP(x,a) aside from memory traffic.


All of these can be implemented using just plain CAS!
Note: There have been various proposals for extended primitives. For example,
1-CAS-n-CMP can be used to implement operations on linked data structures such
as lists and trees. Perhaps the ultimate primitive is k-CAS-n-CMP, which allows
any finite number of locations to be updated atomically, while also requiring
any finite number of other locations to remain unchanged during the operation.

Move between stacks using 2-CAS

let rec try_move backoff source target =
  match Loc.get source with
  | [] -> false
  | (elem::rest) as old_source ->
    let old_target = Loc.get target in
    let ops = [
      Op.make_cas source old_source rest;
      Op.make_cas target old_target (elem::old_target)
    ] in
    Op.atomically ops
    || try_move (Backoff.once backoff) source target

let try_move source target =
  try_move Backoff.default source target
Note: Recall the challenge about moving an element from one stack to another
atomically. Well, with 2-CAS, or double-compare-and-swap, that prolem becomes
easier. * First we read the values of the two stack locations. * Then we
construct a list of the CAS operations. * And attempt to perform them
atomically. * Finally we retry with backoff as usual.

But seriously speaking...

The standard Computer Science solution for this kind of problem is to create
an abstraction mechanism.

😳
Note: The quote is from the paper
Parallel Concurrent ML
by Reppy, Russo, and Xiao. Of course, you still need the low-level ability to
update locations atomically, but, indeed, is CAS the best programming interface
for concurrent programming and, if not, what would be better?

2. Sweat-free programming with k-CAS Txs

😌 ?
Note: I will next introduce one potential abstraction on top of k-CAS. The idea
being that instead of writing loops with backoffs creating and attempting lists
of CAS operations, we instead use a transactional API, that allows us to read
and write shared memory locations and that constructs the lists of operations
and retries with backoffs for us automatically.

What is a Tx?

(** Shared memory locations. *)
module Loc : sig
  type 'a t
  val make : 'a -> 'a t
  val get : 'a t -> 'a
  (* ... *)
end
(** Transactions on shared memory locations. *)
module Tx : sig
  type 'a t
  val get : 'a Loc.t -> 'a t
  val set : 'a Loc.t -> 'a -> unit t
  val return : 'a -> 'a t
  val ( let* ) : 'a t -> ('a -> 'b t) -> 'b t
  val attempt : 'a t -> 'a
  val commit : 'a t -> 'a
  (* ... *)
end
Note: But what is a Tx? Let's first take a peek at the signatures of the core
operations on transactions. The snippets on this slide show only the key
introduction, elimination, and composition forms for shared memory locations and
transactions. **

The answer

let a_loc = Loc.make 10
and b_loc = Loc.make 52
and x_loc = Loc.make 0

Tx.(
  commit (
    let* a_val = get a_loc in
    let* b_val = get b_loc in
    set x_loc (b_val - a_val)
  )
)

assert ( Loc.get x_loc == 42 )
Note: Logically speaking, a transaction 'a Tx.t is a composable operation that
may read and write shared memory locations and that can be performed atomically
by calling commit. Another way to think about transactions is that they are
specifications for generating lists of CAS operations against shared memory
locations. **

More Tx combinators

val get : 'a Loc.t -> 'a t
val get_as : ('a -> 'b) -> 'a Loc.t -> 'b t
val set : 'a Loc.t -> 'a -> unit t
val update : 'a Loc.t -> ('a -> 'a) -> 'a t
val update_as : ('a -> 'b) -> 'a Loc.t -> ('a -> 'a) -> 'b t
val modify : 'a Loc.t -> ('a -> 'a) -> unit t
val exchange : 'a Loc.t -> 'a -> 'a t
val exchange_as : ('a -> 'b) -> 'a Loc.t -> 'a -> 'b t
val return : 'a -> 'a t
val ( let* ) : 'a t -> ('a -> 'b t) -> 'b t
val ( and* ) : 'a t -> 'b t -> ('a * 'b) t
val ( let+ ) : 'a t -> ('a -> 'b) -> 'b t
val ( and+ ) : 'a t -> 'b t -> ('a * 'b) t
val ( >> ) : 'ignore t -> 'a t -> 'a t
val ( >>. ) : 'ignore t -> 'a -> 'a t
val ( >>= ) : 'a t -> ('a -> 'b t) -> 'b t
val map : ('a -> 'b) -> 'a t -> 'b t
val delay : (unit -> 'a t) -> 'a t
Note: To make programming more convenient and to also allow some optimizations
the library actually provides more than just the minimum number of primitives
and combinators. Most of these should be easy to understand in terms of the
types and functional programming tradition. * In update the result value is
the previous value from the location. * The _as variants take an additional
function that is given the previous value from the location to compute the
result. * All of the sequencing combinators can be expressed in terms of
return and let*. * The delay operation is important when side-effects,
such allocation of fresh objects, is needed at the start of a transaction. I
wish OCaml had special syntax for that.

A stack

module type Stack = sig
  type 'a t
  val make : unit -> 'a t
  val push : 'a t -> 'a -> unit
  val try_pop : 'a t -> 'a option
end
module Stack : Stack = struct
  type 'a t = 'a list ref
  let make () = ref []
  let push stack elem = stack := elem :: !stack
  let try_pop stack =
      let elems = !stack in
      match elems with
      | [] -> None
      | elem :: rest -> stack := rest; Some elem
end
Note: Let's build something real. Here is an entirely traditional, sequential
stack data structure. * It uses a ref to hold a list. * push conses an
element to the list. * And try_pop removes an element, taking care to handle
the case of an empty stack.

A transactional stack (1/2)

module type Stack = sig
  type 'a t
  val make : unit -> 'a t
  val push : 'a t -> 'a -> unit Tx.t
  val try_pop : 'a t -> 'a option Tx.t
end
module Stack : Stack = struct
  type 'a t = 'a list Loc.t
  let make () = Loc.make []
  let push stack elem = Tx.modify stack (List.cons elem)
  let try_pop stack = Tx.(
      let* elems = get stack in
      match elems with
      | [] -> return None
      | elem :: rest -> set stack rest >>. Some elem )
end
Note: And here is a transactional stack. First take note of the signature. The
* push and * try_pop operations return transactions. The implementation
uses * shared memory locations. * push is pretty trivial using modify.
* try_pop is more complex as it needs to handle the case of an empty stack.
Note that there are no retry loops or backoffs. The push and try_pop
transactions are not significantly more complex than similar code using refs.

A transactional stack (2/2)

module type Stack = sig
  type 'a t
  val make : unit -> 'a t
  val push : 'a t -> 'a -> unit Tx.t
  val try_pop : 'a t -> 'a option Tx.t
end
module Stack : Stack = struct
  type 'a t = 'a list Loc.t
  let make () = Loc.make []
  let push stack elem = Tx.modify stack (List.cons elem)
  let try_pop stack = Tx.update_as hd_opt stack tl_safe
end
let hd_opt = function [] -> None | elem :: _ -> Some elem
let tl_safe = function [] -> [] | _ :: rest -> rest
Note: Here is a more concise implementation of try_pop using * update_as.
It makes use of two useful list manipulation functions * hd_opt and *
tl_safe. Aside from being more concise, this is also likely a bit faster as
this only makes a single access to the internal transaction log.

Using a transactional stack

let a_stack = Stack.make ()

Tx.commit ( Stack.push a_stack 101 )

assert ( Tx.commit ( Stack.try_pop a_stack ) = Some 101 )

assert ( Tx.commit ( Stack.try_pop a_stack ) = None )
Why not bake in commit to push and try_pop?
Note: To use the transactional stack, one just needs to call the stack to
prepare transactions and then commit them. ** The commit function takes
care of retrying transactions with randomized exponential backoff to avoid
contention. You might ask why we didn't just call commit inside push and
try_pop. We'll come back to that soon.

A transactional queue

module type Queue : sig
  type 'a t
  val make : unit -> 'a t
  val enqueue : 'a t -> 'a -> unit Tx.t
  val try_dequeue : 'a t -> 'a option Tx.t
end
module Queue : Queue = struct
  type 'a t = { front: 'a list Loc.t; back: 'a list Loc.t }
  let make () = { front = Loc.make []; back = Loc.make () )
  let enqueue q x = Tx.modify q.back (List.cons x)
  let try_dequeue q = Tx.(
    update q.front tl_safe >>= function
    | x :: _ -> return (Some x)
    | [] -> exchange_as List.rev q.back [] >>= function
              | [] -> return None
              | x :: xs -> set q.front xs >>. Some x
  )
end
Note: So, here we have a lock-free queue that fits on a single slide and is
pretty much guaranteed to be free of concurrency issues. * It uses two shared
memory locations for the front and back of the queue. * The enqueue
operation is just a push to the back. * The try_dequeue operation is more
complicated as it needs to handle several cases. ** But is this queue any
good? It can actually perform quite well, as we'll see shortly, but it does have
one major weakness. Can you see it? * The problem is that reversal of the back
might take a long time and that could cause consumers to starve. As an exercise
you could implement a transactional queue where both operations are O(1) and not
just in the amortized sense.

The mother of learning

let a_queue = Queue.make ()

Tx.commit ( Queue.enqueue a_queue 101 )

assert (Tx.commit ( Queue.try_dequeue a_queue ) = Some 101)

assert (Tx.commit ( Queue.try_dequeue a_queue ) = None)
Again, why not bake in commit?
Note: There is nothing new to learn on this slide. It just shows again how to
use transactions.

Because Txs compose (1/2)

Tx.(
  commit (
    Stack.push a_stack 3 >>
    Stack.push a_stack 1 >>
    Stack.push a_stack 4
  )
)
Note: The reason we didn't bake in commit is that composition of transactions
is not limited to primitive transactions. Here is an example of pushing three
elements ** to a transactional stack atomically.

Because Txs compose (2/2)

Tx.(
  commit (
    Stack.try_pop a_stack >>= function
    | Some element ->
      Queue.enqueue a_queue element
    | None ->
      return ()
  )
)
Note: Composition also isn't limited to module abstraction boundaries. Here is
an example of moving an element * from a transactional stack to a *
transactional queue * atomically. The transaction mechanism makes it trivial
to compose independently developed transactions and allows transactional data
structures and algorithms to be used for a wider variety of purposes.


The quick brown fox jumps over the lazy dog

Is it a 🦊 or a 🐶 ?

Note: The Tx abstraction that kcas offers seems really nice, but surely it
is going to perform poorly. Indeed! What about performance? Is the transaction
mechanism a lazy dog and CAS the quick brown fox? Well, the mechanism definitely
adds overhead, but perhaps not as much as one might expect.

Maybe it is worth giving a try


algorithm
mean
sd


Reagent MS queue
0.045950
0.000023


1-lock queue
0.027613
0.000086


Tx node queue
0.022827
0.000014


Lockfree MS queue
0.011395
0.000023


2-lock queue
0.010299
0.000005


Tx 2-stack queue
0.009895
0.000000


Always take benchmarks with plenty of salt!
Note: This table is based on running a queue benchmark that is part of the
Reagents library. The
timings vary between runs, but what is interesting is that the k-CAS Tx based
queues perform quite well. It is actually pretty interesting that the simple
2-stack Tx queue is competitive with the hand-written lock-free Michael-Scott
queue. One factor in this is likely that the 2-stack queue probably needs fewer
GC write-barriers. Also, a lot of time has been spent tweaking the k-CAS
implementation to minimize overheads. Finally, I should note that the
transactional node based queue is just a very straightforward implementation. I
would not be surprised if someone came up with a faster Tx queue with constant
time operations.

3. The lock-free GKMZ k-CAS algorithm

🤓 ❤️
Note: Next I'll describe the algorithms used by the kcas library internally.
While I do not expect you to understand every detail of the algorithms after
this talk, I hope to convey a few key ideas and details.

GKMZ?

Efficient Multi-word Compare and Swap
Guerraoui, Kogan, Marathe, Zablotchi
k+1 CASes per k-CAS!
Note: kcas is based on the GKMZ algorithm published in 2020 that achieves a
near optimal k+1 single word CASes per a k-CAS. How is that possible? It is
a cool trick that is worth understanding. Let me explain.

The tables — I mean the types

 type 'a loc = 'a state Atomic.t
 and 'a state = { before : 'a; after : 'a; casn : casn }
 and casn = status Atomic.t
 and status = Undetermined of cass list | After | Before
 and cass = CASS : 'a loc * 'a state -> cass
let make after = Atomic.make {
    before = after;
    after;
    casn = Atomic.make After
  }
internal CASS-descriptor vs logical CAS-operation
Why does this only need k+1 CASes?
Note: Fred Brooks has been attributed to saying that "Show me your flowcharts
and conceal your tables, and I shall continue to be mystified. Show me your
tables, and I won't usually need your flowcharts; they'll be obvious." So, if
you take a moment to think about the data structures, or tables, used in GKMZ,
it should be obvious why the algorithm achieves k+1 CASes. ** I actually
stopped reading the GKMZ paper as soon as I got the trick and just implemented
it following my intuition.

Hint

Consider the traditional approach:
type 'a loc = [ `Value of 'a | `Descriptor of _ ] Atomic.t
How many CASes are needed at minimum?
(* 2nd hint: *) `Value x => `Descriptor _ => `Value y
Take home: Can you use the GKMZ trick elsewhere?
Note: The traditional approach does more work. Can you tell how many CASes are
needed at minimum? The answer is 2*k + 1. You will need at least one CAS to
switch a location to hold a Descriptor and another CAS to switch it back to
holding a Value. And then one more CAS is needed to complete k-CAS operation.
GKMZ avoids the second CAS simply by not requiring locations to hold just plain
values.

Example

Given
let a = make 10 and b = make 52 and x = make 0
after
atomically [CAS(a, 10, 10); CAS(b, 52, 52); CAS(x, 0, 42)]
objects could look like this
let casn = Atomic.make After
let x = Atomic.make {before = 0; after = 42; casn}
let a = Atomic.make {before = 10; after = 10; casn}
let b = Atomic.make {before = 52; after = 52; casn}
Note: With GKMZ the locations (may) always hold state descriptors. When
reading a location one then needs to check which of the values to read. Note
that one of the before or after values is essentially garbage. A real
implementation will also need to perform a release operation to avoid the
state descriptors from pointing to inaccessible objects. It is a simple detail
to take care of after the operation has completed, but we'll skip that for lack
of time.

The GKMZ algorithm (1/3)

 let get loc =
   let state = Atomic.get loc in
   if is_after state.casn then
     state.after
   else
     state.before
 let atomically logical_cas_list =
   let casn = Atomic.make After in
   let rec run cass = function
     | [] ->
       Atomic.set casn (Undetermined cass);
       gkmz casn cass
     | CAS (loc, before, after) :: rest ->
       run (CASS (loc, {before; after; casn})::cass) rest
   in
   run [] logical_cas_list
Note: As mentioned, to read a location, as in get, the operation needs to
determine * which one of the values, before or after, to use. * The
is_after function, which we'll soon see, does that. The atomically function,
* converts logical CAS operations into internal CASS descriptors. An
important detail is that it * creates a cyclic data structure. The cycle is
what makes it possible for other domains to get hold of the entire data
structure and help to complete an operation initiated by some domain. This is a
common pattern in non-blocking algorithms — it might make sense to support
let rec with atomics.

The GKMZ algorithm (2/3)

let rec gkmz casn = function
  | [] -> finish casn After
  | CASS (loc, now) :: next as this ->
    let old = Atomic.get loc in
    if now == old then gkmz casn next else
    let old_value =
      if is_after old.casn then old.after else old.before in
    if old_value != now.before then finish casn Before else
    match Atomic.get casn with
    | Undetermined _ ->
      gkmz casn (if Atomic.compare_and_set loc old now
                 then next else this)
    | determined -> determined == After
 and is_after casn =
   match Atomic.get casn with
   | Undetermined cass -> gkmz casn cass
   | determined -> determined == After
Note: This is the core of the GKMZ algorithm. It basically just goes through the
list of internal descriptors, * compares the current value in each location,
* checks that the operation is still undetermined, * and then CASes the new
state to the location. After all the locations have been handled or an
inconsistency is detected, * it calls finish to attempt to complete the
operation. We'll look at finish shortly. Note that is_after and gkmz are
mutually recursive. When is_after is called, it may encounter an undetermined
operation. In such a case it calls gkmz in order to complete the operation.
This is basically what allows the algorithm to be lock-free. No domain is forced
to indefinitely wait for another. BTW, in the beginning I mentioned that writing
lock-free algorithms is subtle business. * The three Atomic operations in
the core algorithm are done in a specific order. What could happen if you
changed that order? I'll leave that as an exercise.

The GKMZ algorithm (3/3)

 let finish casn desired =
   match Atomic.get casn with
   | Undetermined _ as current ->
     Atomic.compare_and_set casn current desired |> ignore;
     Atomic.get casn == After
   | determined -> determined == After
Note: The last part of the algorithm is finish. * It attempts the final CAS
to complete an operation * and then it does one more check to see whether the
operation completed successfully or failed. This is also what makes the
algorithm atomic. Before the final CAS is performed, no domain may actually read
any of the CASed locations.

The real implementation


Uses a splay tree to order CASS
descriptors

GKMZ consumes internal descriptors anyway
O(k) construction when CASes are ordered
Also used as the transaction log

Optimizes for recently accessed


Performs a release step after final CAS

Remove reference to unused before or after
Also frees stable casn descriptors


Has many small tweaks to improve performance

Note: The implementation in these slides is, of course, a simplified version.
The real implementation...

4. New obstruction-free k-CAS-n-CMP algorithm

Could we do even better? 🤔
Note: The GKMZ algorithm is nearly optimal for k-CAS. What more could one want?
Well, one thing that k-CAS lacks is efficient read-only CMP operations. Next
I'll describe a brand-new obstruction-free k-CAS-n-CMP algorithm that extends
GKMZ.

Motivation

let a = Loc.make 10 and b = Loc.make 52
and x = Loc.make 0 and y = Loc.make 0

let ping () =
  commit (let* a' = get a
          and* b' = get b in
          set x (b' - a'))
let pong () =
  commit (let* a' = get a
          and* b' = get b in
          set y (a' + b'))

ping => [ CAS (a, 10, 10); CAS (b, 52, 52); CAS (x, 0, 42) ]
pong => [ CAS (a, 10, 10); CAS (b, 52, 52); CAS (y, 0, 62) ]

ping )
a b
( pong

Note: But before we go into the details, here is a motivating example using the
Tx API. We have four disjoint locations a, b, x, and y. We have two
transactions that both read two locations a and b. One of the transactions
then writes to location x and the other writes to location y. What happens
when the two transactions are attempted in parallel? Well, they don't. GKMZ will
serialize the operations. To add insult to injury, a CAS also always writes to
memory. So, even though the a and b locations are not logically modified,
they are written to, which causes contention. The way
shared memory coherence protocols
work is that after a write to memory only the writer has a copy of that memory
location. Assume the above two transactions would be done sequentially. The
first transaction writes to a and b, which means that a and b are
evicted from all the caches except the writer's cache. The second transaction
then will have to fetch a and b more expensively and will then write to them
and again evict them from all the other caches. This is called cache-line
ping-pong.

🤔 an Idea!

Verify CMPs after CASes and before final CAS.
But that is only obstruction-free...
Note: How could we implement read-only compares? A simple idea would be to just
verify that the compares still hold just before performing the final CAS. This
sort of post hoc validation is actually a pretty common pattern in non-blocking
algorithms. But it only gives us obstruction-freedom — not lock-freedom.

Obstruction-free


An algorithm is
obstruction-free
if at any point, a single thread executed in isolation for a bounded number of
steps will complete its operation.

Why is a verification step not lock-free?
[CMP (x, 0); CAS (y, 0, 1)] and [CAS (x, 0, 1); CMP (y, 0)]
Note: Recall that lock-freedom requires that at least one domain makes progress.
Consider a domain performing a CMP on x and CAS on y and another domain
doing a CAS on x and CMP on y. Both may be able to enter the verification
step (can you see how?) and both would fail during the verification step. This
could repeat indefinitely and neither domain could make progress.

The k-CAS-n-CMP algorithm (1/3)

 let atomically logical_cas_list =
   let casn = Atomic.make After in
   let rec run cass = function
     | [] ->
       Atomic.set casn (Undetermined cass);
       gkmz casn cass
     | CAS (loc, before, after) :: rest ->
       run (CASS (loc, {before; after; casn}) :: cass) rest
+    | CMP (loc, exp) :: rest ->
+      let old = Atomic.get loc in
+      get loc == exp && Atomic.get loc == old &&
+      run (CASS (loc, current) :: cass) rest
   in
   run [] logical_cas_list
Note: Let's just see how to change GKMZ to give k-CAS-n-CMP. The changes are
shown as diffs against the reference GKMZ implementation. We modify atomically
such that it also accepts CMP operations. It then translates those to CASS
descriptors without creating a new state.

The k-CAS-n-CMP algorithm (2/3)

+let is_cmp casn state = state.casn != casn
 let rec gkmz casn = function
   | [] -> finish casn After
   | CASS (loc, now) :: next as this ->
     let old = Atomic.get loc in
     if now == old then gkmz casn next else
+    if is_cmp casn now then finish casn Before else
     let old_value =
       if is_after old.casn then old.after else old.before in
     if old_value != now.before then finish casn Before else
     match Atomic.get casn with
     | Undetermined _ ->
       gkmz casn (if Atomic.compare_and_set loc old now
                  then next else this)
     | determined -> determined == After
Note: So, how can one tell whether a CASS is a comparison? We just check
whether it refers to the casn of our operation or not. The casn is always
freshly allocated by atomically. If the new and old state differ, we check
whether we were attempting a read-only compare. If so, then we conclude that the
operation failed.

The k-CAS-n-CMP algorithm (3/3)

 let finish casn desired =
   match Atomic.get casn with
-  | Undetermined _ as current ->
+  | Undetermined cass as current ->
+    let desired =
+       if desired == After
+          && cass
+             |> List.exists @@ fun (CASS (loc, state)) ->
+                is_cmp casn state && Atomic.get loc != state then
+          Before
+       else
+          desired in
     Atomic.compare_and_set casn current desired |> ignore;
     Atomic.get casn == After
   | determined -> determined == After
Note: The biggest change to the GKMZ algorithm is the addition of the post hoc
verification step. Just before attempting to complete the transaction we go
through all the CASS descriptors and check once more that the read-only
comparisons are still valid.

Freedom!

let rec commit backoff mode xt =
  match attempt mode xt with
  | result -> result
  | exception Interference ->
      let backoff = Backoff.once backoff in
      commit backoff Mode.lock_free xt
  | exception Exit ->
      let backoff = Backoff.once backoff in
      commit backoff mode xt
Note: The actual implementation distinguishes between failure before and during
verification. If a transaction attempt fails during verification * then the
Interference exception is raised. This allows the commit loop to * switch
to lock-free mode, which only uses CAS operations and requires no verification
step. This guarantees that with commit transactions are lock-free. So, we
don't actually lose anything!

Read-only updates

module Queue : Queue = struct
  type 'a t = { front: 'a list Loc.t; back: 'a list Loc.t }
  let make () = { front = Loc.make []; back = Loc.make () )
  let enqueue q x = Tx.modify q.back (List.cons x)
  let try_dequeue q = Tx.(
    update q.front tl_safe >>= function
    | x :: _ -> return (Some x)
    | [] -> exchange_as List.rev q.back [] >>= function
              | [] -> return None
              | x :: xs -> set q.front xs >>. Some x
  )
end
Does try_dequeue perform read-only CMPs?
Note: Recall the queue implementation. If the queue is empty, the try_dequeue
transaction may only perform read-only comparisons. Also, if the front is empty
and the back contains only a single element, the front is not necessarily
written. That is because * in those cases the try_dequeue transaction never
writes a new value to those locations. The transaction mechanism is smart enough
to optimize that to a read-only compare. This means that multiple domains can
try to dequeue an element from an empty queue in parallel without interference.
The conditionals "may" and "not necessarily" are there to account for the
possibility of interference from writers.

5. Blocking transactions

✋
Note: What we have so far is probably about as far as kcas can be stretched
without adding significant dependencies or overheads, but perhaps we could have
a back door for an additional feature: blocking transactions. What is that?

Motivation

stm @@ let* x = Queue.dequeue qa in
       Queue.enqueue qb x
What if qa is empty?
Note: Consider a transaction that dequeues a value x from qa and enqueues it
to qb. Seems straightforward. But what if qa is empty? Should we just keep
on retrying in the hope that some other domain pushes an element into qa? That
sort of busy-wait is usually a poor idea.

Try, try, try again!

let dequeue q =
  Queue.try_dequeue q >>= function
  | None ->
    retry ()
  | Some x ->
    return x

a new, modular form of blocking

Note: How might one even implement a dequeue operation that always returns a
(non-optional) value? The idea here is that * retry suspends the
transaction, waits until some location accessed during the transaction is
mutated, and then restarts the transaction. This kind of retry is a key idea
put forth in the
Composable Memory Transactions
paper.

Idea: Tagged locations

module Loc : sig
  type ('tag, 'a) tagged

  val make_tagged: 'tag -> 'a -> ('tag, 'a) tagged
  val get_tag : ('tag, 'a) tagged -> 'tag

  type 'a t = (unit, 'a) t

  (* ... *)
end
To associate waiters with a location.
Note: To support blocking, one essentially needs a way to signal waiters. After
mutating some locations the mutator signals waiters. For a scalable mechanism
that signal needs to be selective and only wake up those waiters that are
interested in the mutated locations. To associate waiters with locations in a
truly low-overhead fashion, one possibility would be to allow locations to be
"tagged". In a blocking transaction mechanism that 'tag could be a bag of the
waiters of changes to the location.

Idea: Limited access to log

module Tx : sig
  (* ... *)
  module Log : sig
    type t
    type 'r reduce = {one: 't 'a. ('t, 'a) Loc.tagged -> 'r;
                      zero: 'r; plus: 'r -> 'r -> 'r}
    val reduce : 'r reduce -> t -> 'r
  end

  exception Retry of unit t
  val reset_and_retry : (Log.t -> unit t) -> 'a t

  val with_written: 'a t -> (Log.t -> unit t) -> 'a t
end
Note: To make it easy for a blocking transaction mechanism to figure out which
locations need to be waited for, limited access to the transaction log could be
granted. * reset_and_retry returns a transaction that resets the current
transaction such that it only reads from the accessed locations. The given
function is then called with * the internal transaction log to * construct a
transaction that is then composed after the current transaction. The composed
transaction is then * raised as a Retry exception. * with_written
returns a transaction that executes as the given transaction and then calls the
given function with a view of the internal transaction log restricted to the
written locations. The returned transaction is then composed after the whole
transaction.

STM Sketch

type 'a t = 'a Tx.t

let retry () = Tx.reset_and_retry @@ fun log ->
  (* add waiters to all (tagged) locations in log *)
let rec stm waiters tx =
  match Tx.attempt @@ Tx.with_written tx @@ fun log ->
  (* move waiters from (tagged) locations to waiters list *)
  with
  | result -> (* signal all waiters and then *) result
  | exception Exit -> stm all_waiters tx
  | exception Tx.Retry add_waiters_tx -> (
    match Tx.attempt add_waiters_tx with
    | () -> (* do blocking wait and retry *) stm waiters tx
    | exception Exit ->
      (* locations were mutated before waiters were added *)
      stm tx)

let stm tx = stm (Loc.make []) tx
Note: Here is a quick sketch with some blanks left as an exercise of how one
could implement STM with blocking transactions with the help of tagged
locations, reset_and_retry and with_written. The key point here is the type
of blocking STM transactions on the first line. The point is that the blocking
STM would not need to add a layer on top of the lower level k-CAS transactions.
This should basically mean that blocking transactions could be implemented with
roughly half the performance of plain non-blocking transactions as every access
potentially requires also touching the waiters list. I'm curious to find out how
well this would perform against existing STM implementations!

Summary


Lock-free programming with CASes is hard 💦


Programming with Tx is easy by composi...arison


Thanks to

careful low-level optimizations, splay tree,
new k-CAS-n-CMP algorithm, and
interference detection

Tx can give ok performance and lock-freedom


Tx is sweat-free concurrency!

Questions?

Note: Thank you for listening! I'm out of slides.

  
## styles.css
@keyframes bounce {
  0% {
    left: 30%;
  }
  100% {
    left: -30%;
  }
}

@keyframes rotate {
  0% {
    transform: rotate(0deg);
  }
  100% {
    transform: rotate(360deg);
  }
}

@keyframes fonts {
  0% {
    /* font-size: 50%; */
  }
  100% {
    font-size: 100%;
  }
}

:hover > .bounce {
  position: relative;
  animation: bounce 2s ease-in-out infinite alternate,
    rotate 2s ease-in-out infinite, fonts 1s ease-in-out infinite alternate;
}

.reveal .hljs-ln-numbers {
  display: none;
}

## tarides.png

      
    Raw
  

              tarides.png
algorithm	mean	sd
Reagent MS queue	0.045950	0.000023
1-lock queue	0.027613	0.000086
`Tx` node queue	0.022827	0.000014
Lockfree MS queue	0.011395	0.000023
2-lock queue	0.010299	0.000005
`Tx` 2-stack queue	0.009895	0.000000
	@keyframes bounce {
	0% {
	left: 30%;
	}
	100% {
	left: -30%;
	}
	}

	@keyframes rotate {
	0% {
	transform: rotate(0deg);
	}
	100% {
	transform: rotate(360deg);
	}
	}

	@keyframes fonts {
	0% {
	/* font-size: 50%; */
	}
	100% {
	font-size: 100%;
	}
	}

	:hover > .bounce {
	position: relative;
	animation: bounce 2s ease-in-out infinite alternate,
	rotate 2s ease-in-out infinite, fonts 1s ease-in-out infinite alternate;
	}

	.reveal .hljs-ln-numbers {
	display: none;
	}