lrytz/gist:a979bf2b410fc9058be317189856671c Secret

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    Dear Experts
I'm new on this list and not a concurrency expert. I'm the lead of the Scala team at Lightbend which maintains the compiler and standard library.
We recently had a bug report of a parallel collections operation being blocked on the termination of a Future started within the operation body. I managed to reproduce the behavior in a fairly small Java-only example using ForkJoinPool, the code is here: https://gist.github.com/lrytz/a622c541d91336030a8cb788273bdc66
The code starts a task, which forks two subtasks and joins them, the subtasks recursively do the same, until some depth. Each one of these tasks is fast (2ms). Additionally, the first task creates a subtask that is slow (200ms), but does not join on it. On Java 8, each top-level task returns quickly. On Java 9 and later, from time to time the top-level task gets blocked on the slow subtask and only returns after 200+ ms. I tested Java 1.8.0_261-b12 for 8, and 9+181, 11.0.6+10, 14.0.1+7, 15+36-1562.
The issue is more common when the pool's parallelism is lower. With a parallelism of 4, I did a 10 seconds flight recording, and there is a pretty obvious difference: on Java 8 there were > 20 pool threads, while on Java 11 there were only 5 of them. The screenshots show the long wait periods on Java 11: scala/bug#12106 (comment).
I did a short search of this list's archive and found one thread that looks somewhat related, though if I understand correctly that older discussion is about tasks that involve some sort of blocking, which is not the case in this example. http://altair.cs.oswego.edu/pipermail/concurrency-interest/2014-May/012669.html
I'd like to know whether the Java 9+ behavior is a bug or within spec - I hope it's an appropriate question for this list. https://docs.oracle.com/javase/9/docs/api/java/util/concurrent/ForkJoinPool.html says:

The pool attempts to maintain enough active (or available) threads by dynamically adding, suspending, or resuming internal worker threads, even if some tasks are stalled waiting to join others

Best,
Lukas

Olivier Peyrusse <...@...>
Hello Lukas,
I haven't found the exact lines of code that causes your problem, but basically, it comes from the change to the ForkJoinPool algorithm to create less additonal threads.
While previously calling #join() would create a new side thread to help completing the task being joined, the new algorithm tries in place to consume submitted tasks (I am still looking for the commit that changed that)
The thing is that calling #execute from within a ForkJoinThread, like you do within Test$T, is detected by the code [1], causing the task with the long sleep to be recorded in the same work queue as your other Test$T tasks. And it behaves differently than an ForkJoinPool#externalPush, as it could occurs when called from your main for example.
So when calling #join(), some workers may consider that executing the long-sleep task helps in the join. As a consequence, even if not joining explicitly on it, you do.
Changing Test$T#compute() to the following code highlights that it is what happens.
for (T t : subtasks) {
  final long begin = System.nanoTime();
  t.join();
  long duration = (System.nanoTime() - begin) / 1000_000;
  if (duration > 200) {
    System.out.println("Long sleep picked on a join");
  }
}
Not being a maintainer of the FJP, I cannot tell if this is an intended design or some edge-case.
Cheers
Olivier
[1] https://github.com/openjdk/jdk/blob/3c4e824aa5fd99337759fa5000f5673ab2430750/src/java.base/share/classes/java/util/concurrent/ForkJoinPool.java#L1921-L1931