AFulgens/README.asciidoc Secret

## README.asciidoc

      
    Raw
  

              README.asciidoc
            
          
    Public knowledge


NQ >= 10'000


This is the rule of thumb, when you should even start thinking about using parallel streams over regular ones. This means that if the the multiple of the number of datapoints (N) and the computation required per data point (Q) is above 10'000. Judging of the Q is usually tricky, as a general rule the Q for addition is 1, and you can judge from here (e.g., multiplication is quadratic in input length). This rule actually comes from Brian Goetz himself, as you can see his comment  in this discussion on StackOverflow, and is detailed in this description hosted by the State University of New York.


You should also consider points 4 and 10 from this very good list about subtle mistakes when using the Streams API. 4 is about topping your CPU while digesting infinite stream with a parallel stream (oops), while 12 is actually a deadlock which are easy to achieve with parallel streams in complex projects.


However, I would like to quote my favorite conclusion, which can be found at the end of this article:


Parallel streams are unpredictable and complex to use correctly. Almost any use of parallel streams can affect the performance of other unrelated system components in an unpredictable way. I have no doubt that there are people who can manage to use them to their benefit, clearly and correctly. However, I’d think twice before typing stream.parallel() into my code and would look twice when reviewing the code containing it.


— Oleg Shelajev


This actually leads back to a bottom line by Brian Goetz: use non-parallel streams. If you have a hunch that you have a bottleneck in one of your streams, then first do a measurement whether it really is in the stream. Then do some thinking about whether the job of the stream is parallelisable at all. If yes, then think about it, how many problems it can cause. And then you usually realise that it is easier to refactor your process than to deal with the invisible and hard-to-control parallelism within the Stream API.


Many people think this part of the API is actually for academic purposes only, not meant for production…


Performance tests with JMH


Nevertheless I did some performance tests with JMH, I will attach the exact outputs as a 7z file (eventually, when I’m not lazy to clone and push this gist), here I write just a quick summary of the results. Disclosure: the results are not meant to be absolute, just relative to each other for comparison (e.g., I was using my PC during the measurement, so they are not at all "clean" numbers, only "good enough").


General description of the tests:


Input data was 10 / 100 / 1'000 / 10'000 / 100'000 random Integer instances, generated statically before each test run


The tests were run with different Collections, namely ArrayList, LinkedList, HashSet, TreeSet (from java.util) and HashBag, MultiSet (from apache.commons.collections4).


Test type #1 (reduction):


Three different (parallel)streams tested:


Reducing the Collection with a trivial associative: a simple x+y


Reducing the Collection with a non-trivial non-associative funcition: sqrt(x^y mod MOD) where MOD is 1522605027 (first 10 digits of RSA-100’s first half)


Reducing the Collection with a non-trivial associative function: gcd(x, y)


Results for these tests of type #1:


As the NQ-Rule suggests, it is not worth to use parallel streams for additions for a small number of data ⇒ 190 vs 11'933 ops/ms score for 10 input data; 111 vs 17'474 for 100; 75 vs 181 for 1'000; 18 vs 16 for 10'000 (how was the rule about NQ >= 10000? :D); 1.356 vs 0.580 for 100'000 (the numbers are for ArrayLists, every Collection behaves along similar lines)


Never ever (ever, ever) use parallel streams for non-associative reductions. As it may be clear after some thinking, this can be done only in a linear fashion, and forcing it into a paralleled framework will slow it down. Example numbers for ArraysLists (again, every other Collection behaved very similarly): 0.03-0.08 reductions per millisecond in parallel streams (item count does not matter) vs. 15910.52 (for 10 items) to 24181 (for 1000 items) reductions per millisecond.


Reduction of associative function in parallel streams seems to be OK, keep in mind the NQ-Rule. For GCD the scores were for ArrayLists: 120 vs 1096 for 10; 67 vs 62 for 100 (thus Q for GCD is about 100); 15 vs 9 for 1'000; 1.522 vs 0.598 for 10'000; 0.100 vs 0.046 for 100'000


Test type #2 (collecting):


Two different (parallel)streams tested:


Map the Integers to their doubles as map(x → 2 * x), then collect the stream into a list


Map the Integers as map(x → x^EXP mod MOD), where EXP is 80688 (first 5 digits of RSA-100’s second half) and MOD is 1522605027 (first 10 digits of RSA-100’s first half), then collect the stream into a list


Note that for parallel streams the result list has an unpredictable order per index w.r.t. the input list (the collector collects whatever comes in any of the pipes from the parallel streams)


Results for these tests of type #2:


As one would guess after all the above, doubling should not be parallel for small number of inputs: 145 vs 3139 for 10; 82 vs 577 for 100; 41 vs 45 for 1'000; 6.166 vs 5.112 for 10'000; 0.904 vs 0.600 for 100'000


As one would guess after all the above, paralleling the modular exponentiation makes sense pretty early: 63 vs 81 for 10; 15 vs 6 for 100; 1.435 vs 1.004 for 1'000; 0.169 vs 0.088 for 10'000; 0.020 for 0.012 for 100'000


General remarks:


Keep in mind that we are talking about microbenchmarks here. The above numbers roughly mean that for example on a non-parallel stream one can make 1'200 such modular exponentiation per millisecond. Everything is relative, this is just a performance benchmark, and after a while the overhead of the parallel stream framework is smaller than the work expected to be done by the threads, still one should consider whether this gain is worth the extra thoughts, time and risk (see the linked articles for risks…).


Also keep in mind, that all the numbers above I listed are for ArrayLists, for more complex data structures, the parallel streams are taking over earlier or the ratios differ, e.g., for a MultiSet the modular exponentiation for 100 Integers gets a score of 2.589 on a single thread and 10.705 on parallel streams; that’s a factor of five for 100 numbers. That factor is 22 for 100'000 numbers… and here I refer to the point above: we are talking about 40 vs. 80 milliseconds here for digesting 100'000 data points.


Bottom line:


Don’t even think about parallel streams, only if you have a proof that your stream is a real bottleneck. Even then, first look around, whether you can streamline the surroundings of the stream so it does not become a bottleneck. Only if these two conditions hold should you consider using parallel streams in production. And even then, be careful.


And of course I close with D. Knuth:


(For rookies) Don’t optimize.

(For gurus) Don’t optimize, yet.


— not really D. Knuth

just programmers' folklore