Skip to content

Instantly share code, notes, and snippets.

@RaasAhsan
Last active June 16, 2023 06:37
Show Gist options
  • Star 10 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save RaasAhsan/8e3554a41e07068536425ca0de46c9e8 to your computer and use it in GitHub Desktop.
Save RaasAhsan/8e3554a41e07068536425ca0de46c9e8 to your computer and use it in GitHub Desktop.
minimized ARM memory barrier violation
import java.util.concurrent.atomic.*;
import java.util.concurrent.*;
public class Main {
private static ExecutorService executor = Executors.newFixedThreadPool(2);
private static int iterations = 10000000;
public static class Runner {
// writes to canceled happen before a CAS on suspended
// reads on canceled happen after a CAS on suspended
private boolean canceled = false;
// an optimistic lock. false == locked, true == unlocked
private AtomicBoolean suspended = new AtomicBoolean(false);
private volatile boolean result = false;
private CountDownLatch latch = new CountDownLatch(1);
public void start() {
// start two tasks that synchronize on suspended and canceled
// this is a minimized version of a synchronization mechanism in cats effect
Future<?> f1 = executor.submit(() -> {
try {
latch.await();
} catch (Exception ex) {
ex.printStackTrace();
}
// assumption: this task already has the lock
// release the lock
suspended.compareAndSet(false, true);
if (canceled) {
// double-check, the other thread may have set canceled but failed the CAS check,
// so we'll try to reacquire the lock
if (suspended.compareAndSet(true, false)) {
result = true;
}
}
});
Future<?> f2 = executor.submit(() -> {
try {
latch.await();
} catch (Exception ex) {
ex.printStackTrace();
}
canceled = true;
// attempt to acquire the lock to set result.
// regardless of whether the CAS succeeds or not, the write to canceled should be published
if (suspended.compareAndSet(true, false)) {
result = true;
}
});
// signal threads to proceed
latch.countDown();
try {
// wait for tasks to complete
f1.get();
f2.get();
} catch (Exception ex) {
ex.printStackTrace();
}
// after both tasks have completed, result should be true
if (result != true) {
System.out.println("STUCK");
System.exit(0);
}
}
}
public static void main(String[] args) {
for (int i = 0; i < iterations; i++) {
Runner runner = new Runner();
runner.start();
}
System.exit(0);
}
}
@simonis
Copy link

simonis commented Nov 11, 2020

I don't think this program demonstrates any "memory barrier violation" on ARM. Printing "STUCK" is a valid execution path according to the current Java Memory Model and API specification.

The JMM was revised by JSR-133 starting with JDK 1.5. JSR 133 considerably strengthened the semantics of volatile fields. The following section from the JSR-133 FAQ is relevant for this discussion:

Writing to a volatile field has the same memory effect as a monitor release, and reading from a volatile field has the same memory effect as a monitor acquire. In effect, because the new memory model places stricter constraints on reordering of volatile field accesses with other field accesses, volatile or not, anything that was visible to thread A when it writes to volatile field f becomes visible to thread B when it reads f.

The following code example from the same source illustrates what this means:

class VolatileExample {
  int x = 0;
  volatile boolean v = false;
  public void writer() {
    x = 42;
    v = true;
  }

  public void reader() {
    if (v == true) {
      //uses x - guaranteed to see 42.
    }
  }
}

I.e. when v == true in reader() the JMM guarantees that x will have the value 42. This wasn't guaranteed before the JMM update by JSR-133.

Now lets have a look at the API specification of AtomicBoolean.compareAndSet(boolean expectedValue, boolean newValue):

Atomically sets the value to newValue if the current value == expectedValue, with memory effects as specified by VarHandle.compareAndSet(java.lang.Object...).

The if is important here. Let's see what are the memory effects specified by VarHandle.compareAndSet(java.lang.Object...):

Atomically sets the value of a variable to the newValue with the memory semantics of setVolatile(java.lang.Object...) if the variable's current value, referred to as the witness value, == the expectedValue, as accessed with the memory semantics of getVolatile(java.lang.Object...).

Checking the specification of VarHandle.getVolatile() and VarHandle.setVolatile we find that they "return the value of a variable, with memory semantics of reading" or "set the value of a variable to the newValue, with memory semantics of setting" as if "the variable was declared volatile".

This means that AtomicBoolean.compareAndSet() only behaves as a write to a volatile field if the preceding volatile read of that fieled returned a value equal to expectedValue. In other words, if the volatile read of the respective field isn't equal to expectedValue, AtomicBoolean.compareAndSet() will behave like a simple volatile read and won't exhibit the effects which are guaranteed for "volatile writes" by the JMM.

In the contect of your example, this means that in contrast to your comment, the following code sequence in Future f2:

canceled = true;
if (suspended.compareAndSet(true, false)) {
  result = true;
}

will not publish the write to canceled to other threads in cases where suspended.compareAndSet() fails because in such cases compareAndSet() degenerates to a simple volatile read which according to the JMM only has acquire but not release semantics.

It is therefore very well possible that Future f1, running in another thread, will successfully execute the suspended.compareAndSet(false, true) operation but still not see the updated value of canceled:

suspended.compareAndSet(false, true);
if (canceled) {
  ...
}

Implementation-wise I could see your test program printing "STUCK" when running on Graviton 1 but not on Graviton 2. The reason for this difference is that Graviton 1 doesn't support the Large System Extensions (LSE) introduced in ARMv8.1. Without LSE, a call to AtomicBoolean.compareAndSet() translates into the following machine code:

0x0000ffffa8b14090:   ldaxr	w8, [x10]              // Load-Acquire Exclusive from memory
0x0000ffffa8b14094:   cmp	w8, w11                // compare with expected value
0x0000ffffa8b14098:   b.ne	0x0000ffffa8b140a4     // jump out if not equal
0x0000ffffa8b1409c:   stlxr	w8, w13, [x10]         // Store-Release Exclusive to memory
0x0000ffffa8b140a0:   cbnz	w8, 0x0000ffffa8b14090 // Jump back and retry if exclusive store failed
0x0000ffffa8b140a4:   cset	x11, eq                // set result (i.e. did CAS succeed)

It uses a pair of LDAXR (Load-Acquire Exclusive)/STLXR (Store-Release Exclusive) instructions to implement the CAS. With this implementation, if the loaded value fails to equal to the expected value the store together with its release semantics is skipped (which is in total consensus with AtomicBoolean.compareAndSet()'s specification linked before).

On newer Graviton 2 CPU's which implement ARMv8.2 and support LSE, the following code will be generated instead:

0x0000ffff94b1403c:   casal	w8, w13, [x10]  // CAS with acquire/release semantics
0x0000ffff94b14040:   cmp	w8, w11         // compare with expected value
0x0000ffff94b14044:   cset	x11, eq         // set result (i.e. did CAS succeed?)

Here the LDAXR/STLXR pair is replaced by the new CASAL instruction which is faster and which has acquire/release semantics no matter if the CAS succeeds or not. That's the reason why your test seems to work on Graviton 2. But it can be easily made printing "STUCK" on Graviton 2 as well if we forcibly disable the usage of LSE instructions with the -XX:-UseLSE option.

On a side node, I can report that I've seen your test exting with "STUCK" on Power 8 as well. Power 8 also uses a pair of load/store exclusive instructions for implementing CAS and also has a weak memory model.

Update 2020-11-19: it seems that what I wrote before about the CASAL instruction is not totally true.

The "ARM Architecture Reference Manual ARMv8, for ARMv8-A architecture profile" notes the following about memory barriers in chapter B2.3.10 on page B2-139:

A Store-Release Exclusive instruction only has the release semantics if the store is successful.

Later, in section C3.2.12 about Atomic instructions on page C3-222 it notes the following about Compare and Swap:

The instructions are provided with ordering options, which map to the acquire and release definitions used in the
Armv8-A architecture. If a compare and swap instruction does not perform a store, then the instruction does not
have release semantics, regardless of the instruction ordering options.

I don't know why your example succeeds on Graviton 2 with CASAL. It might be that the actual chip implementation is stricter than the specification and still releases even if the store part of a CASAL wasn't executed. But it might be just as well that the "STUCK" output became just much more unlikely.

Notice that on x86_64 if you are prefixing a CMPXCHG with LOCK, you are guaranteed to get the release semantics of the write in any case:

This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically. To simplify the interface to the processor’s bus, the destination operand receives a write cycle without regard to the result of the comparison. The destination operand is written back if the comparison fails; otherwise, the source operand is written into the destination. (The processor never produces a locked read without also producing a locked write.)

@RaasAhsan
Copy link
Author

Thanks @simonis I'll take a deeper look later today, but the memory effects of atomics were precisely an assumption that we weren't sure was sound, and there seems to be a severe lack of elaboration on this point. I see many of your documentation links point to Java 11, but since we're mostly on Scala running Java 8, we've been referring to those docs.

In the atomics package documentation, it reads that "compareAndSet and all other read-and-update operations such as getAndIncrement have the memory effects of both reading and writing volatile variables." It's ambiguous whether the compareAndSet needs to succeed here for those memory effects to take place, but since it was left unspecified, we interpreted it to mean that that was always the case. It also lumps compareAndSet together with operations like getAndIncrement (which are CAS loops internally and so will eventually succeed), which suggested to us that the memory effects for every individual call should be consistent.

The Java 9 docs seem to make these details a bit more clear, so I'm more convinced now that our code is improperly synchronized

@mo-beck
Copy link

mo-beck commented Nov 11, 2020

+1 Volker (@simonis). You did an awesome job with your explanation. So, I won't do that in my reply here. But since I had already run these on my ThunderX2 system, I am attaching the below information for completeness. :)

Arm's supports LSE extensions since v8.1.
@RaasAhsan: I tried your minimized example as follows on a ThunderX2 system with and without LSE:

monica@c50-36-Ubun:~/projects/bmks/atomicbmk$ ../../jdks/jdk-16/bin/java -XX:-UseLSE Main
STUCK

the generated assembly here (without LSE) will look like this:
; - java.util.concurrent.atomic.AtomicBoolean::compareAndSet@22 (line 101)
0x0000ffff8a83dbd8: add x0, x1, #0xc
0x0000ffff8a83dbdc: ldaxr w8, [x0]
0x0000ffff8a83dbe0: cmp w8, w6
0x0000ffff8a83dbe4: b.ne 0x0000ffff8a83dbf0 // b.any
0x0000ffff8a83dbe8: stlxr w8, w7, [x0]
0x0000ffff8a83dbec: cbnz w8, 0x0000ffff8a83dbdc
0x0000ffff8a83dbf0: cset x8, ne // ne = any
0x0000ffff8a83dbf4: dmb ish``

And then you can try enabling LSE as shown here:
monica@c50-36-Ubun:~/projects/bmks/atomicbmk$ ../../jdks/jdk-16/bin/java -XX:+UseLSE Main
monica@c50-36-Ubun:~/projects/bmks/atomicbmk$
<note the lack of STUCK :)>

And the generated assembly again:

; - java.util.concurrent.atomic.AtomicBoolean::compareAndSet@22 (line 101)
0x0000ffff86847aa4: add x0, x1, #0xc
0x0000ffff86847aa8: mov x8, x6
0x0000ffff86847aac: casal w8, w7, [x0]
0x0000ffff86847ab0: cmp w8, w6
0x0000ffff86847ab4: cset x8, ne // ne = any
0x0000ffff86847ab8: dmb ish

@djspiewak
Copy link

Thank you all for the deep attention to this! You are all fantastic and this is incredibly helpful in so many ways.

@TheRealMDoerr
Copy link

canceled = true; if (suspended.compareAndSet(true, false))
is a Store-Load pattern. Nonvolatile Store is not ordered wrt. succeeding Volatile Load.
Should canceled not be volatile to fix this?

@djspiewak
Copy link

If canceled were volatile then this would work quite trivially. 😃 It would also be much much slower.

@TheRealMDoerr
Copy link

TheRealMDoerr commented Nov 12, 2020

If canceled were volatile then this would work quite trivially. 😃 It would also be much much slower.

Volatile has a price.
But without it, the JVM doesn't need to prevent reorderings as already explained above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment