NQ >= 10'000
This is the rule of thumb, when you should even start thinking about using parallel streams over regular ones. This means that if the the multiple of the number of datapoints (N) and the computation required per data point (Q) is above 10'000. Judging of the Q is usually tricky, as a general rule the Q for addition is 1, and you can judge from here (e.g., multiplication is quadratic in input length). This rule actually comes from Brian Goetz himself, as you can see his comment in this discussion on StackOverflow, and is detailed in this description hosted by the State University of New York.
You should also consider points 4 and 10 from this very good list about subtle mistakes when using the Streams API. 4 is about topping your CPU while digesting infinite stream with a parallel stream (oops), while 12 is actually a deadlock which are easy to achieve with parallel streams in complex projects.
However, I would like to quote my favorite conclusion, which can be found at the end of this article:
Parallel streams are unpredictable and complex to use correctly. Almost any use of parallel streams can affect the performance of other unrelated system components in an unpredictable way. I have no doubt that there are people who can manage to use them to their benefit, clearly and correctly. However, I’d think twice before typing stream.parallel() into my code and would look twice when reviewing the code containing it.
This actually leads back to a bottom line by Brian Goetz: use non-parallel streams. If you have a hunch that you have a bottleneck in one of your streams, then first do a measurement whether it really is in the stream. Then do some thinking about whether the job of the stream is parallelisable at all. If yes, then think about it, how many problems it can cause. And then you usually realise that it is easier to refactor your process than to deal with the invisible and hard-to-control parallelism within the Stream API.
Many people think this part of the API is actually for academic purposes only, not meant for production…
Nevertheless I did some performance tests with JMH, I will attach the exact outputs as a 7z file (eventually, when I’m not lazy to clone and push this gist), here I write just a quick summary of the results. Disclosure: the results are not meant to be absolute, just relative to each other for comparison (e.g., I was using my PC during the measurement, so they are not at all "clean" numbers, only "good enough").
-
Input data was 10 / 100 / 1'000 / 10'000 / 100'000 random
Integer
instances, generated statically before each test run -
The tests were run with different
Collection
s, namelyArrayList
,LinkedList
,HashSet
,TreeSet
(fromjava.util
) andHashBag
,MultiSet
(fromapache.commons.collections4
).
-
Three different (parallel)streams tested:
-
Reducing the
Collection
with a trivial associative: a simplex+y
-
Reducing the
Collection
with a non-trivial non-associative funcition:sqrt(x^y mod MOD)
whereMOD
is 1522605027 (first 10 digits of RSA-100’s first half) -
Reducing the
Collection
with a non-trivial associative function:gcd(x, y)
-
-
As the NQ-Rule suggests, it is not worth to use parallel streams for additions for a small number of data ⇒ 190 vs 11'933 ops/ms score for 10 input data; 111 vs 17'474 for 100; 75 vs 181 for 1'000; 18 vs 16 for 10'000 (how was the rule about NQ >= 10000? :D); 1.356 vs 0.580 for 100'000 (the numbers are for
ArrayList
s, everyCollection
behaves along similar lines) -
Never ever (ever, ever) use parallel streams for non-associative reductions. As it may be clear after some thinking, this can be done only in a linear fashion, and forcing it into a paralleled framework will slow it down. Example numbers for
ArraysList
s (again, every otherCollection
behaved very similarly): 0.03-0.08 reductions per millisecond in parallel streams (item count does not matter) vs. 15910.52 (for 10 items) to 24181 (for 1000 items) reductions per millisecond. -
Reduction of associative function in parallel streams seems to be OK, keep in mind the NQ-Rule. For
GCD
the scores were forArrayList
s: 120 vs 1096 for 10; 67 vs 62 for 100 (thus Q forGCD
is about 100); 15 vs 9 for 1'000; 1.522 vs 0.598 for 10'000; 0.100 vs 0.046 for 100'000
-
Two different (parallel)streams tested:
-
Map the
Integer
s to their doubles asmap(x → 2 * x)
, then collect the stream into a list -
Map the
Integer
s asmap(x → x^EXP mod MOD)
, whereEXP
is 80688 (first 5 digits of RSA-100’s second half) andMOD
is 1522605027 (first 10 digits of RSA-100’s first half), then collect the stream into a list-
Note that for parallel streams the result list has an unpredictable order per index w.r.t. the input list (the collector collects whatever comes in any of the pipes from the parallel streams)
-
-
-
As one would guess after all the above, doubling should not be parallel for small number of inputs: 145 vs 3139 for 10; 82 vs 577 for 100; 41 vs 45 for 1'000; 6.166 vs 5.112 for 10'000; 0.904 vs 0.600 for 100'000
-
As one would guess after all the above, paralleling the modular exponentiation makes sense pretty early: 63 vs 81 for 10; 15 vs 6 for 100; 1.435 vs 1.004 for 1'000; 0.169 vs 0.088 for 10'000; 0.020 for 0.012 for 100'000
-
Keep in mind that we are talking about microbenchmarks here. The above numbers roughly mean that for example on a non-parallel stream one can make 1'200 such modular exponentiation per millisecond. Everything is relative, this is just a performance benchmark, and after a while the overhead of the parallel stream framework is smaller than the work expected to be done by the threads, still one should consider whether this gain is worth the extra thoughts, time and risk (see the linked articles for risks…).
-
Also keep in mind, that all the numbers above I listed are for
ArrayList
s, for more complex data structures, the parallel streams are taking over earlier or the ratios differ, e.g., for aMultiSet
the modular exponentiation for 100 Integers gets a score of 2.589 on a single thread and 10.705 on parallel streams; that’s a factor of five for 100 numbers. That factor is 22 for 100'000 numbers… and here I refer to the point above: we are talking about 40 vs. 80 milliseconds here for digesting 100'000 data points.
Don’t even think about parallel streams, only if you have a proof that your stream is a real bottleneck. Even then, first look around, whether you can streamline the surroundings of the stream so it does not become a bottleneck. Only if these two conditions hold should you consider using parallel streams in production. And even then, be careful.
And of course I close with D. Knuth:
(For rookies) Don’t optimize.
(For gurus) Don’t optimize, yet.
just programmers' folklore