amontalenti/Storm-Topology-Freeze-Issues.md

## client-closed.txt
Note: This exception happened after killing one of the nodes and restarting it.

java.lang.RuntimeException: java.lang.RuntimeException: Client is being closed, and does not take requests any more
 at backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:128) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
 at backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:99) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
 at backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
 at backtype.storm.disruptor$consume_loop_STAR_$fn__751.invoke(disruptor.clj:94) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
 at backtype.storm.util$async_loop$fn__450.invoke(util.clj:431) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
 at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
 at java.lang.Thread.run(Thread.java:744) [na:1.7.0_55]
Caused by: java.lang.RuntimeException: Client is being closed, and does not take requests any more
 at backtype.storm.messaging.netty.Client.send(Client.java:194) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
 at backtype.storm.utils.TransferDrainer.send(TransferDrainer.java:54) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
 at backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5035$fn__5036.invoke(worker.clj:322) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
 at backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5035.invoke(worker.clj:320) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
 at backtype.storm.disruptor$clojure_handler$reify__738.onEvent(disruptor.clj:58) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
 at backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:125) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
 ... 6 common frames omitted
2014-06-20 04:41:06 b.s.util [INFO] Halting process: ("Async loop died!")

## connection-timeout.txt
Note: This exception happened after the cluster was running for a little while; same stack trace happened for basically all workers:

2014-06-20 04:59:53 b.s.m.n.StormClientErrorHandler [INFO] Connection failed Netty-Client-ue1a-storm5.lan.cogtree.com/10.22.0.141:6711
java.io.IOException: Connection timed out
 at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[na:1.7.0_55]
 at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[na:1.7.0_55]
 at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[na:1.7.0_55]
 at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[na:1.7.0_55]
 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) ~[na:1.7.0_55]
 at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:322) ~[netty-3.2.2.Final.jar:na]
 at org.jboss.netty.channel.socket.nio.NioWorker.processSelectedKeys(NioWorker.java:281) ~[netty-3.2.2.Final.jar:na]
 at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:201) ~[netty-3.2.2.Final.jar:na]
 at org.jboss.netty.util.internal.IoWorkerRunnable.run(IoWorkerRunnable.java:46) [netty-3.2.2.Final.jar:na]
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_55]
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_55]
 at java.lang.Thread.run(Thread.java:744) [na:1.7.0_55]

## stacktraces.zip

      
    Raw
  

              stacktraces.zip
            
          
            View raw
        
    
## Storm-Topology-Freeze-Issues.md

      
    Raw
  

              Storm-Topology-Freeze-Issues.md
            
          
    Conditions

Two Storm supervisors running in a 3-node configuration: one "head" node
(nimbus) and two worker nodes.  Each supervisor node runs a supervisor under
supervision.
Topology

Multi-lang topology using Python and our streamparse library. Works fine when
running on a single machine. Spout reads off Kafka and inserts tuples into
topology, other bolts are doing various analysis with latencies from 10ms -
3,000ms. Spout parallelism of 4-8 and max.spout.pending value of 500-1000.
We have been testing the topology for months using a two-node cluster, single
supervisor plus nimbus. We have tweaked worker parallelism settings; on a
single supervisor node, we often have 8-12 worker JVM processes running.  And
the topology handled this fine.
The second you try to run the topology on multiple nodes, however, lots of
weird behavior starts happening. Instead of a steady stream of about 1-2k reads
per second from Kafka, we get this spiky behavior where 1k reads happen, but
then the topology enters a frozen state, and no more tuples get pushed through
the spout.
view image
Versions

This seems to happen in both 0.9.1-incubating and the new 0.9.2-incubating
release candidate. It seems to affect both ZeroMQ and Netty transports.
Stack Traces

Download the stack traces in stacktraces.zip below.
Included with this report are stack traces gathered from all the worker JVMs (+
supervisor) at two states on each supervisor node.
The first state (dump1) is when I see the spout has stopped sending tuples
and failures have started showing up in the Storm UI.
The second state (dump2) is when I decided to forcibly deactivate the Spout
via the Storm UI, wait about 10 seconds, and then run this.
I then experimented a little and killed the supervisor on the second node. I
then watched as new workers spun up on the first node. And then I proceeded to
see how the topology was still frozen. So, I took a stack trace of all the pids
on this node only (supervisor1).  That is dump3.
Machine characteristics

The most interesting thing about this is that the machines aren't at 0% CPU or
frozen. Instead, a few of the bolts are still doing a little work, and the JVM
worker processes are at 20% CPU usage each. But the load is nowhere near where
it should be if this thing were running full speed.
view image
The other odd part is that when I deactivated the spout, nothing about the
load of the machine changed.  Normally, when the cluster is healthy,
deactivating the spout has the effect of dropping the load dramatically,
especially on the JVM worker machines. But in this case, it basically had no
effect. I thought that odd enough to take the second set of stack trace dumps
(dump2).
It was at that point that I had the idea to force the workers to move to a
single node and that I snapshotted dump3 while noting that the machine load
had stayed pretty much the same (20% CPU per worker JVM and basically no CPU
usage at the spouts).
How Stack Traces Were Taken

I used a combination of jps and jstack. For example:
sudo jps -l | grep -v 'Jps' | cut -f 1 -d' ' >pids
for pid in `cat pids`; do
    sudo jstack -F $pid >$pid.dump;
done

UPDATE: FIXED

The issue had nothing to do with Storm and everything to do with Ubuntu 14.04
and its interaction with Xen network kernel drivers in EC2.
I was staring at the results of this research and thinking, "What could possibly
cause the network subsystem of Storm to just hang?"
My first impulse: firewalls. Maybe as the network was ramping up, I was hitting
up against a firewall rule?
I checked our munin monitoring graphs and noticed a bunch of eth0 errors correlated
with our topologies running. I checked our production Storm 0.8.2 cluster -- no errors.
Ah hah! It must be firewall rules or something!
That led me to run dmesg on the supervisor nodes. I found a bunch of entries like
this:
xen_netfront: xennet: skb rides the rocket: 20 slots
xen_netfront: xennet: skb rides the rocket: 19 slots

That's odd. I also saw some entries related to ufw (Ubuntu's firewall service). So,
I try running ufw disable. No change.
I then dig in more to these error messages and I come across this open bug on Launchpad:
https://bugs.launchpad.net/ubuntu/+source/linux-lts-raring/+bug/1195474
I dig in there and come across the current workaround, running:
sudo ethtool -K eth0 sg off

On the server. I issue that command, restart my topology, and VOILA, it's now running
at full performance.
Back in my earliest days as a professional programmer, I had a friend named Jimmy. I
once spent 3 days debugging a JVM garbage collection issue with him. We ran profilers,
did detailed code traces, extensive logging, etc. And in the end, the fix to the problem
was a single line of code change -- a mistaken allocation of an expensive object that was
happening in a tight loop. At that moment, I coined "Jimmy's Law", which is:
"The amount of time it takes to discover a bug's fix is inversely proportional
to the lines of code changed by the fix, with infinite time converging to one
line."
After hours of investigating and debugging this issue, that's certainly how I feel.
Shame on me for upgrading my Storm cluster and Ubuntu version simultaneously!
	Note: This exception happened after killing one of the nodes and restarting it.

	java.lang.RuntimeException: java.lang.RuntimeException: Client is being closed, and does not take requests any more
	at backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:128) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
	at backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:99) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
	at backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
	at backtype.storm.disruptor$consume_loop_STAR_$fn__751.invoke(disruptor.clj:94) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
	at backtype.storm.util$async_loop$fn__450.invoke(util.clj:431) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
	at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
	at java.lang.Thread.run(Thread.java:744) [na:1.7.0_55]
	Caused by: java.lang.RuntimeException: Client is being closed, and does not take requests any more
	at backtype.storm.messaging.netty.Client.send(Client.java:194) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
	at backtype.storm.utils.TransferDrainer.send(TransferDrainer.java:54) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
	at backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5035$fn__5036.invoke(worker.clj:322) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
	at backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5035.invoke(worker.clj:320) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
	at backtype.storm.disruptor$clojure_handler$reify__738.onEvent(disruptor.clj:58) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
	at backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:125) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
	... 6 common frames omitted
	2014-06-20 04:41:06 b.s.util [INFO] Halting process: ("Async loop died!")
	Note: This exception happened after the cluster was running for a little while; same stack trace happened for basically all workers:

	2014-06-20 04:59:53 b.s.m.n.StormClientErrorHandler [INFO] Connection failed Netty-Client-ue1a-storm5.lan.cogtree.com/10.22.0.141:6711
	java.io.IOException: Connection timed out
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[na:1.7.0_55]
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[na:1.7.0_55]
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[na:1.7.0_55]
	at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[na:1.7.0_55]
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) ~[na:1.7.0_55]
	at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:322) ~[netty-3.2.2.Final.jar:na]
	at org.jboss.netty.channel.socket.nio.NioWorker.processSelectedKeys(NioWorker.java:281) ~[netty-3.2.2.Final.jar:na]
	at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:201) ~[netty-3.2.2.Final.jar:na]
	at org.jboss.netty.util.internal.IoWorkerRunnable.run(IoWorkerRunnable.java:46) [netty-3.2.2.Final.jar:na]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_55]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_55]
	at java.lang.Thread.run(Thread.java:744) [na:1.7.0_55]