Skip to content

Instantly share code, notes, and snippets.

@baontq
Last active August 29, 2015 14:08
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save baontq/a024d1377806ea10f7c9 to your computer and use it in GitHub Desktop.
Save baontq/a024d1377806ea10f7c9 to your computer and use it in GitHub Desktop.
(WIP) Going to production - Apache Storm
Some lessons we learnt along the way pushing our topologies to production environment.
Apache Storm
http://storm.apache.org/
https://github.com/apache/storm.git
Some notes are specific to Trident on Storm 0.9.2.
---------------
Error handling
Excellent notes: http://svendvanderveken.wordpress.com/2014/02/05/error-handling-in-storm-trident-topologies/
This must be taken care at upfront design, unfortunately it's not always the case. The default behavior of Storm when un-caught exception thrown is killing worker jvm and restart it. A bad message in Kafka queue will be replayed over and over. This issue can block the whole process and it's not straightforward to skip the bad message in queue without the code change.
"worker died"
The rule is
```
All alternative cases must be handled.
```
---------------
Progress/status checking
Counting and tracking progress is not simple as it sounds in distributed env.
---------------
All functions/operations in topology should be idempotent, this is another challenge.
---------------
Memory leak.
This is quite easy with early planning. Running topology on local mode and throwing million messages over and over, monitoring jvm's health should be good enough to identify the issue.
---------------
Latency of external services
---------------
Doubts
First time running storm cluster in your own environment will bring in a lot of doubts. Esp. in a complicated infrastructure with Zk, Kafka, C*, ES, Storm, and some external services. Together with tweaking cluster configuration it can make the situation more complicated. Sometimes Storm cannot recover workers and a bad habit operation needs to be applied (stop cluster, clear queue, clear state,...).
Rule: batch size, TridentState/Functions/QueryFunction life cycle, anti-patterns (shared state, static variable,...).
---------------
Storm performance turning
Excellent notes: https://gist.github.com/mrflip/5958028
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment