fwbrasil/doc.md

## doc.md

      
    Raw
  

              doc.md
            
          
    Scala Future optimizations


Flavio W. Brasil, July 2017

Problem


The Future implementation provided by the Scala Standard Library introduces considerable performance overhead when compared to other Future libraries. This cost is probably acceptable for low-scale systems but can be problematic at high scale.
Let’s consider this benchmark:

  
      private static final Function1 mapF = i -> "string";
  private static final ExecutionContext ec = scala.concurrent.ExecutionContext.global();
@Benchmark
public Future mapPromise() {
return Promise.apply().future().map(mapF, ec);
}
  

I’ve isolated the execution of this method in a single iteration after warmup and profiled it using yourkit’s allocation profiler. Creating a Promise and mapping it involves five object allocations and 120 bytes of memory footprint:

The jmh output confirms the profiling results:

  
    Benchmark
    Mode
    Threads
    Samples
    Score
    Score Error (99.9%)
    Unit
  
  
    mapPromise
    thrpt
    1
    3
    21879893.29
    1539495.68
    ops/s
  
  
    mapPromise:·gc.alloc.rate
    thrpt
    1
    3
    1668.95
    83.34
    MB/sec
  
  
    mapPromise:·gc.alloc.rate.norm
    thrpt
    1
    3
    120.00
    0.00
    B/op
  
  
    mapPromise:·gc.churn.PS_Eden_Space
    thrpt
    1
    3
    1582.11
    76.99
    MB/sec
  
  
    mapPromise:·gc.churn.PS_Eden_Space.norm
    thrpt
    1
    3
    113.76
    1.88
    B/op
  
  
    mapPromise:·gc.churn.PS_Survivor_Space
    thrpt
    1
    3
    0.05
    0.58
    MB/sec
  
  
    mapPromise:·gc.churn.PS_Survivor_Space.norm
    thrpt
    1
    3
    0.00
    0.04
    B/op
  
  
    mapPromise:·gc.count
    thrpt
    1
    3
    21.00
    NaN
    counts
  
  
    mapPromise:·gc.time
    thrpt
    1
    3
    12.00
    NaN
    ms
  

Solution


I’ve open sourced a new project that is a Future implementation in java inspired by Twitter’s Future and focused on performance. It supports most of the features provided by Twitter’s Future, including signals (cancellation) and locals.
The optimizations applied to this new project make the implementation diverge considerably from Twitter’s Future, but we were able to port some of the changes. The results were significant regarding of both CPU usage and allocation rate decrease.
For comparison, the same operation above described, when executed against the new project, requires only one object allocation for each operation (promise creation, mapping) and 40 bytes of memory:

The jmh output confirms the profiling results:

  
    Benchmark
    Mode
    Threads
    Samples
    Score
    Score Error (99.9%)
    Unit
  
  
    mapPromise
    thrpt
    1
    3
    59993938.44
    9740470.96
    ops/s
  
  
    mapPromise:·gc.alloc.rate
    thrpt
    1
    3
    1524.91
    261.10
    MB/sec
  
  
    mapPromise:·gc.alloc.rate.norm
    thrpt
    1
    3
    40.00
    0.00
    B/op
  
  
    mapPromise:·gc.churn.PS_Eden_Space
    thrpt
    1
    3
    1358.31
    20.65
    MB/sec
  
  
    mapPromise:·gc.churn.PS_Eden_Space.norm
    thrpt
    1
    3
    35.63
    5.83
    B/op
  
  
    mapPromise:·gc.churn.PS_Survivor_Space
    thrpt
    1
    3
    0.04
    0.38
    MB/sec
  
  
    mapPromise:·gc.churn.PS_Survivor_Space.norm
    thrpt
    1
    3
    0.00
    0.01
    B/op
  
  
    mapPromise:·gc.count
    thrpt
    1
    3
    18.00
    NaN
    counts
  
  
    mapPromise:·gc.time
    thrpt
    1
    3
    9.00
    NaN
    ms
  

The throughput of the new implementation (59,993,938.44) is 2.74 times higher than Scala’s Future (21,879,893.29) in this benchmark. It scores better in other scenarios as well.
The project has a set of benchmarks comparing Twitter’s, Scala’s, and Java’s Futures. I’ve compiled the jmh results into a spreadsheet with the throughput and gc scores.
Overview


From a general perspective, these are the techniques I’ve used to optimize the trane.io implementation:
Reduce closure allocations
Closures are so convenient and easy to create in Java 8 and Scala that it’s common to create one without thinking about the allocation of the closure object and references to the outer scope. For instance, this function (_ += _) is instantiated for each list element. It’d be possible to cache a single instance of it and reuse for all elements.
Merge interfaces
A typical pattern in the future implementation is creating a promise and other control structures to perform a transformation. It’s often possible to create a Promise new class that satisfies multiple interfaces and instantiate a single object. As an example, these three objects (p, completeFirst, foreach closure) can be merged into a single one.
Specialize implementations
Sometimes an abstraction can simplify the code but hurt performance. For example, Scala’s Future trait is a nice abstraction; it requires its subclasses to implement only a few methods. This approach comes with a performance penalty: the other methods require an additional closure allocation that adapts the user function to the transform function. Specializing methods often enables optimizations [1] [2] [3].
Keep methods small
JIT inlining is crucial to reduce CPU utilization. Some hot methods used by Scala’s Future can’t be inlined because they are too large or become large after inlining its nested methods. I’ve inspected the JIT log of the new project multiple times to make sure that the hot methods were being inlined. This work can be tricky with a scala codebase since the bytecode is often larger than the code we write.
The next sections introduce a few optimization ideas based on the revision 9ab72a2 from Jul 11 of the Scala repository. Please note that they are only speculative at this point and require proper validation through microbenchmarks if we decide to implement them.
Minor optimizations


Future.failed


Cache the transformation function as a companion object field since it doesn’t capture context


It might be worth throwing an exception without stack trace instead of a plain NoSuchElementException since the stack trace doesn’t add much information for Future execution and stack traces are expensive


Future.transform


The by-name parameter of Try.apply produces a closure with at least two fields (s and r). The closure could be avoided using a regular try/catch block that produces a Try


For failures, throw is probably more expensive than checking if the exception is fatal using NonFatal.apply.


Future.filter

Maybe it’s worth throwing a stackless exception if the predicate doesn’t hold

Future.collect


Cache the fallback function as a companion object field


Maybe use a stackless exception


Future.recoverWith

The applyOrElse function could be avoided using isDefined + apply. It’s not possible to cache the function since it captures context.

Future.fallbackTo

Instead of calling recoverWith twice, use tranformWith to avoid allocating two closures and avoid the double cost of calling recoverWith twice.

Future$.sequence

Cache the foldLeft, zipWith, and map functions as companion object vals

Future$.firstCompletedOf


Return the element if the collection has only one element


Create a Promise anon class that mixes the function trait that satisfies the foreach signature, avoiding the closure allocation.


Future$.find

Create a val for the transformWith function and reuse it

Future$.traverse

Reuse the zipWith and map cached functions created for Future$.sequence

impl.Promise.resolver

This method could take a Failure to avoid allocating a new instance if the exception doesn’t require special handling (last case)

impl.Promise.transform

Merge the function trait required by onComplete into the a new anon class that also extends DefaultPromise

impl.Promise.transformWith

Merge the function trait required by onComplete into a new anon class that also extends DefaultPromise

impl.DefaultPromise.result

Instead of calling value use get() directly and cast it to avoid the Option allocation

Medium optimizations


Future$.sequence

zipWith is called for each element of the collection and is too expensive. This method could collect results into an array and transform it to the result collection using the cbf. As reference, see the collect method of the trane.io impl.

Future$.traverse

This method could share some code with an optimized version of Future$.sequence.

Future.*

The kept promise classes already override methods that can be optimized based on the outcome of the promise, but other methods could also be optimized. For instance, Future.map could be implemented by KeptPromise.Successful and submit a lightweight Runnable to the ExecutionContext instead of going through the expensive path of calling transform. I suggest making the Future trait abstract, without method bodies, and reimplement all methods in an optimized manner in the concrete Promise classes.

Major optimizations


Introduce WaitQueue

Attaching new CallbackRunnables to Promises requires allocation of a new List. A new WaitQueue implementation that acts both as the callback and as the queue itself could avoid the List allocation. I’m not sure what’s the best approach: creating a pointer to the previous callback (immutable, but requires traversing to the root when executing) or to the next one (mutable, harder to manage).

Introduce ExecutionContext.immediate: Boolean


The Scala Future is an outlier when it comes to scheduling behavior. Most other implementations have a way to reuse the thread that is creating the composition to execute a transformation when possible, avoiding the high scheduling cost. Once the Future.* methods are implemented by the concrete classes, it’s possible to add an if condition to the appropriate methods that checks if immediate returns true and executes the transformation immediately.


This new method opens the path to implement an immediate-only ExecutionContext that could replace the internal executor, avoiding the Runnable allocation overhead


For recursive computations, an immediate and batching execution context similar to trane.io’s Tailrec (this class was inspired by Monix) would be a good addition


Non-optimization notes


Future$.firstCompletedOf

Shouldn’t this method fail if the collection is empty?

Future$.find

This method doesn’t seem to have the same behavior as deprecated version. It seems that, if the last element of the collection finishes first, it’ll never be the result since the implementation will wait for all other futures that come before it in the collection.

Conclusion


This document only gives an overview of the potential optimizations. If we decide to proceed, it will be much easier to discuss the details with code reviews and jmh results for each change.
Note that, even though the Scala Future currently has lower scores for most benchmarks, it has the potential to have better scores than trane.io and Twitter. It doesn’t implement cancellations and locals, so it doesn’t incur the performance overhead of these features.
License


This document is licensed under the Apache License 2.0.
Benchmark	Mode	Threads	Samples	Score	Score Error (99.9%)	Unit
mapPromise	thrpt	1	3	21879893.29	1539495.68	ops/s
mapPromise:·gc.alloc.rate	thrpt	1	3	1668.95	83.34	MB/sec
mapPromise:·gc.alloc.rate.norm	thrpt	1	3	120.00	0.00	B/op
mapPromise:·gc.churn.PS_Eden_Space	thrpt	1	3	1582.11	76.99	MB/sec
mapPromise:·gc.churn.PS_Eden_Space.norm	thrpt	1	3	113.76	1.88	B/op
mapPromise:·gc.churn.PS_Survivor_Space	thrpt	1	3	0.05	0.58	MB/sec
mapPromise:·gc.churn.PS_Survivor_Space.norm	thrpt	1	3	0.00	0.04	B/op
mapPromise:·gc.count	thrpt	1	3	21.00	NaN	counts
mapPromise:·gc.time	thrpt	1	3	12.00	NaN	ms