Last active
August 29, 2015 14:08
-
-
Save baontq/a024d1377806ea10f7c9 to your computer and use it in GitHub Desktop.
(WIP) Going to production - Apache Storm
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Some lessons we learnt along the way pushing our topologies to production environment. | |
Apache Storm | |
http://storm.apache.org/ | |
https://github.com/apache/storm.git | |
Some notes are specific to Trident on Storm 0.9.2. | |
--------------- | |
Error handling | |
Excellent notes: http://svendvanderveken.wordpress.com/2014/02/05/error-handling-in-storm-trident-topologies/ | |
This must be taken care at upfront design, unfortunately it's not always the case. The default behavior of Storm when un-caught exception thrown is killing worker jvm and restart it. A bad message in Kafka queue will be replayed over and over. This issue can block the whole process and it's not straightforward to skip the bad message in queue without the code change. | |
"worker died" | |
The rule is | |
``` | |
All alternative cases must be handled. | |
``` | |
--------------- | |
Progress/status checking | |
Counting and tracking progress is not simple as it sounds in distributed env. | |
--------------- | |
All functions/operations in topology should be idempotent, this is another challenge. | |
--------------- | |
Memory leak. | |
This is quite easy with early planning. Running topology on local mode and throwing million messages over and over, monitoring jvm's health should be good enough to identify the issue. | |
--------------- | |
Latency of external services | |
--------------- | |
Doubts | |
First time running storm cluster in your own environment will bring in a lot of doubts. Esp. in a complicated infrastructure with Zk, Kafka, C*, ES, Storm, and some external services. Together with tweaking cluster configuration it can make the situation more complicated. Sometimes Storm cannot recover workers and a bad habit operation needs to be applied (stop cluster, clear queue, clear state,...). | |
Rule: batch size, TridentState/Functions/QueryFunction life cycle, anti-patterns (shared state, static variable,...). | |
--------------- | |
Storm performance turning | |
Excellent notes: https://gist.github.com/mrflip/5958028 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment