Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Storm 0.9.2 exceptions related to multiple workers/supervisors
Note: This exception happened after killing one of the nodes and restarting it.
java.lang.RuntimeException: java.lang.RuntimeException: Client is being closed, and does not take requests any more
at backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:128) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
at backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:99) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
at backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:80) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
at backtype.storm.disruptor$consume_loop_STAR_$fn__751.invoke(disruptor.clj:94) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
at backtype.storm.util$async_loop$fn__450.invoke(util.clj:431) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
at clojure.lang.AFn.run(AFn.java:24) [clojure-1.5.1.jar:na]
at java.lang.Thread.run(Thread.java:744) [na:1.7.0_55]
Caused by: java.lang.RuntimeException: Client is being closed, and does not take requests any more
at backtype.storm.messaging.netty.Client.send(Client.java:194) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
at backtype.storm.utils.TransferDrainer.send(TransferDrainer.java:54) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
at backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5035$fn__5036.invoke(worker.clj:322) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
at backtype.storm.daemon.worker$mk_transfer_tuples_handler$fn__5035.invoke(worker.clj:320) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
at backtype.storm.disruptor$clojure_handler$reify__738.onEvent(disruptor.clj:58) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
at backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:125) ~[storm-core-0.9.2-incubating.jar:0.9.2-incubating]
... 6 common frames omitted
2014-06-20 04:41:06 b.s.util [INFO] Halting process: ("Async loop died!")
Note: This exception happened after the cluster was running for a little while; same stack trace happened for basically all workers:
2014-06-20 04:59:53 b.s.m.n.StormClientErrorHandler [INFO] Connection failed Netty-Client-ue1a-storm5.lan.cogtree.com/10.22.0.141:6711
java.io.IOException: Connection timed out
at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[na:1.7.0_55]
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[na:1.7.0_55]
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[na:1.7.0_55]
at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[na:1.7.0_55]
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379) ~[na:1.7.0_55]
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:322) ~[netty-3.2.2.Final.jar:na]
at org.jboss.netty.channel.socket.nio.NioWorker.processSelectedKeys(NioWorker.java:281) ~[netty-3.2.2.Final.jar:na]
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:201) ~[netty-3.2.2.Final.jar:na]
at org.jboss.netty.util.internal.IoWorkerRunnable.run(IoWorkerRunnable.java:46) [netty-3.2.2.Final.jar:na]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_55]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_55]
at java.lang.Thread.run(Thread.java:744) [na:1.7.0_55]

Conditions

Two Storm supervisors running in a 3-node configuration: one "head" node (nimbus) and two worker nodes. Each supervisor node runs a supervisor under supervision.

Topology

Multi-lang topology using Python and our streamparse library. Works fine when running on a single machine. Spout reads off Kafka and inserts tuples into topology, other bolts are doing various analysis with latencies from 10ms - 3,000ms. Spout parallelism of 4-8 and max.spout.pending value of 500-1000.

We have been testing the topology for months using a two-node cluster, single supervisor plus nimbus. We have tweaked worker parallelism settings; on a single supervisor node, we often have 8-12 worker JVM processes running. And the topology handled this fine.

The second you try to run the topology on multiple nodes, however, lots of weird behavior starts happening. Instead of a steady stream of about 1-2k reads per second from Kafka, we get this spiky behavior where 1k reads happen, but then the topology enters a frozen state, and no more tuples get pushed through the spout.

view image

Versions

This seems to happen in both 0.9.1-incubating and the new 0.9.2-incubating release candidate. It seems to affect both ZeroMQ and Netty transports.

Stack Traces

Download the stack traces in stacktraces.zip below.

Included with this report are stack traces gathered from all the worker JVMs (+ supervisor) at two states on each supervisor node.

The first state (dump1) is when I see the spout has stopped sending tuples and failures have started showing up in the Storm UI.

The second state (dump2) is when I decided to forcibly deactivate the Spout via the Storm UI, wait about 10 seconds, and then run this.

I then experimented a little and killed the supervisor on the second node. I then watched as new workers spun up on the first node. And then I proceeded to see how the topology was still frozen. So, I took a stack trace of all the pids on this node only (supervisor1). That is dump3.

Machine characteristics

The most interesting thing about this is that the machines aren't at 0% CPU or frozen. Instead, a few of the bolts are still doing a little work, and the JVM worker processes are at 20% CPU usage each. But the load is nowhere near where it should be if this thing were running full speed.

view image

The other odd part is that when I deactivated the spout, nothing about the load of the machine changed. Normally, when the cluster is healthy, deactivating the spout has the effect of dropping the load dramatically, especially on the JVM worker machines. But in this case, it basically had no effect. I thought that odd enough to take the second set of stack trace dumps (dump2).

It was at that point that I had the idea to force the workers to move to a single node and that I snapshotted dump3 while noting that the machine load had stayed pretty much the same (20% CPU per worker JVM and basically no CPU usage at the spouts).

How Stack Traces Were Taken

I used a combination of jps and jstack. For example:

sudo jps -l | grep -v 'Jps' | cut -f 1 -d' ' >pids
for pid in `cat pids`; do
    sudo jstack -F $pid >$pid.dump;
done

UPDATE: FIXED

The issue had nothing to do with Storm and everything to do with Ubuntu 14.04 and its interaction with Xen network kernel drivers in EC2.

I was staring at the results of this research and thinking, "What could possibly cause the network subsystem of Storm to just hang?"

My first impulse: firewalls. Maybe as the network was ramping up, I was hitting up against a firewall rule?

I checked our munin monitoring graphs and noticed a bunch of eth0 errors correlated with our topologies running. I checked our production Storm 0.8.2 cluster -- no errors. Ah hah! It must be firewall rules or something!

That led me to run dmesg on the supervisor nodes. I found a bunch of entries like this:

xen_netfront: xennet: skb rides the rocket: 20 slots
xen_netfront: xennet: skb rides the rocket: 19 slots

That's odd. I also saw some entries related to ufw (Ubuntu's firewall service). So, I try running ufw disable. No change.

I then dig in more to these error messages and I come across this open bug on Launchpad:

https://bugs.launchpad.net/ubuntu/+source/linux-lts-raring/+bug/1195474

I dig in there and come across the current workaround, running:

sudo ethtool -K eth0 sg off

On the server. I issue that command, restart my topology, and VOILA, it's now running at full performance.

Back in my earliest days as a professional programmer, I had a friend named Jimmy. I once spent 3 days debugging a JVM garbage collection issue with him. We ran profilers, did detailed code traces, extensive logging, etc. And in the end, the fix to the problem was a single line of code change -- a mistaken allocation of an expensive object that was happening in a tight loop. At that moment, I coined "Jimmy's Law", which is:

"The amount of time it takes to discover a bug's fix is inversely proportional to the lines of code changed by the fix, with infinite time converging to one line."

After hours of investigating and debugging this issue, that's certainly how I feel. Shame on me for upgrading my Storm cluster and Ubuntu version simultaneously!

@staslev
Copy link

staslev commented Aug 27, 2014

Nice catch.

We had the same exception on a cluster with NIC bonded nodes, where one of the bonded NICs was mistakenly configured to 100Mbps instead of 1000Mbps, which was getting it saturated and ultimately causing the infamous [ERROR] Async loop died! java.lang.RuntimeException: java.lang.RuntimeException: Remote address is not reachable. We will close this client Netty-Client**** and java.lang.RuntimeException: java.lang.RuntimeException: Client is being closed, and does not take requests any more exceptions, neither of which Storm recovers from.

@bartvercammen
Copy link

bartvercammen commented Oct 8, 2014

Shame on me for upgrading my Storm cluster and Ubuntu version simultaneously!

I made exactly the same mistake ;-)

Thanks for the provided work around and problem analysis.
This really saved my day.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment