This trident topology emits the DRPC argument to a splitter, when then emits the argument 1000 times to a family of 100 bolts, which then emit 5000 results each. A final aggregator increments a count for each result. With these numbers, the expected DRPC result is 1000 * 5000 = 5,000,000.
While this program seems like it would be simple and fast, it's not. It spends 75% of its time waiting on the LMAX disruptor. As a result, just emitting tuples becomes the clear bottleneck. Here's a snippet of log file times for trying to emit 5k tuples.
2013-08-19 14:31:01.056 [Thread-167-b-4] [DEBUG] i.k.s.d.b.PerformanceTestDrpcStream EachFunctionExecute, t=3,082.48
2013-08-19 14:31:01.057 [Thread-163-b-4] [DEBUG] i.k.s.d.b.PerformanceTestDrpcStream EachFunctionExecute, t=2,946.93
2013-08-19 14:31:01.057 [Thread-145-b-4] [DEBUG] i.k.s.d.b.PerformanceTestDrpcStream EachFunctionExecute, t=2,962.79
2013-08-19 14:31:01.057 [Thread-187-b-4] [DEBUG] i.k.s.d.b.PerformanceTestDrpcStream EachFunctionExecute, t=69.27
2013-08-19 14:31:01.059 [Thread-193-b-4] [DEBUG] i.k.s.d.b.PerformanceTestDrpcStream EachFunctionExecute, t=3,392.72
2013-08-19 14:31:01.061 [Thread-223-b-4] [DEBUG] i.k.s.d.b.PerformanceTestDrpcStream EachFunctionExecute, t=3,173.38
2013-08-19 14:31:01.064 [Thread-143-b-4] [DEBUG] i.k.s.d.b.PerformanceTestDrpcStream EachFunctionExecute, t=7.83
2013-08-19 14:31:01.065 [Thread-64-b-4] [DEBUG] i.k.s.d.b.PerformanceTestDrpcStream EachFunctionExecute, t=2,942.67
2013-08-19 14:31:01.066 [Thread-167-b-4] [DEBUG] i.k.s.d.b.PerformanceTestDrpcStream EachFunctionExecute, t=9.24
2013-08-19 14:31:01.066 [Thread-193-b-4] [DEBUG] i.k.s.d.b.PerformanceTestDrpcStream EachFunctionExecute, t=7.25
2013-08-19 14:31:01.068 [Thread-139-b-4] [DEBUG] i.k.s.d.b.PerformanceTestDrpcStream EachFunctionExecute, t=2,951.14
2013-08-19 14:31:01.068 [Thread-223-b-4] [DEBUG] i.k.s.d.b.PerformanceTestDrpcStream EachFunctionExecute, t=7.53
This results in a response time of 25 seconds to process 5m tuples, or 200k tuples / second (in local cluster, 1 worker, 8G heap).
The disruptor waiting gets worse with parallelism - i.e. with 10 EachFunctions the waiting is usually 100's of ms, but with 100 it's multiple seconds (up to 8 or 9).
If instead of emitting 5000 tuples I emit 1 tuple with a list of 5000 objects, the response time for the entire call is 1-2s.
I've tried to change LMAX queue settings per http://www.michael-noll.com/blog/2013/06/21/understanding-storm-internal-message-buffers/ but little changes. I've also run this topology in traditional storm (not trident) and the same problem happens.
Are my expectations about the rate of emits just way off? (i.e. 5k small strings taking multiple seconds) If so that's fine, just want to make sure I'm not missing something obvious here.