baontq/gist:a024d1377806ea10f7c9

## gistfile1.txt
Some lessons we learnt along the way pushing our topologies to production environment.

Apache Storm
    http://storm.apache.org/
    https://github.com/apache/storm.git

Some notes are specific to Trident on Storm 0.9.2.

---------------
Error handling
    Excellent notes: http://svendvanderveken.wordpress.com/2014/02/05/error-handling-in-storm-trident-topologies/

    This must be taken care at upfront design, unfortunately it's not always the case. The default behavior of Storm when un-caught exception thrown is killing worker jvm and restart it. A bad message in Kafka queue will be replayed over and over. This issue can block the whole process and it's not straightforward to skip the bad message in queue without the code change.


    "worker died"

    The rule is
    ```
    All alternative cases must be handled.
    ```

---------------
Progress/status checking
    Counting and tracking progress is not  simple as it sounds in distributed env.

---------------
All functions/operations in topology should be idempotent, this is another challenge.

---------------
Memory leak.
    This is quite easy with early planning. Running topology on local mode and throwing million messages over and over, monitoring jvm's health should be  good enough to identify the issue.

---------------
Latency of external services

---------------
Doubts
First time running storm cluster in your own environment will bring in a lot of doubts. Esp. in a complicated infrastructure with Zk, Kafka, C*, ES, Storm, and some external services. Together with tweaking cluster configuration it can make the situation more complicated. Sometimes Storm cannot recover workers and a bad habit operation needs to be applied (stop cluster, clear queue, clear state,...).

Rule: batch size, TridentState/Functions/QueryFunction life cycle, anti-patterns (shared state, static variable,...).

---------------
Storm performance turning
Excellent notes: https://gist.github.com/mrflip/5958028
	Some lessons we learnt along the way pushing our topologies to production environment.

	Apache Storm
	http://storm.apache.org/
	https://github.com/apache/storm.git

	Some notes are specific to Trident on Storm 0.9.2.

	---------------
	Error handling
	Excellent notes: http://svendvanderveken.wordpress.com/2014/02/05/error-handling-in-storm-trident-topologies/

	This must be taken care at upfront design, unfortunately it's not always the case. The default behavior of Storm when un-caught exception thrown is killing worker jvm and restart it. A bad message in Kafka queue will be replayed over and over. This issue can block the whole process and it's not straightforward to skip the bad message in queue without the code change.


	"worker died"

	The rule is
	```
	All alternative cases must be handled.
	```

	---------------
	Progress/status checking
	Counting and tracking progress is not simple as it sounds in distributed env.

	---------------
	All functions/operations in topology should be idempotent, this is another challenge.

	---------------
	Memory leak.
	This is quite easy with early planning. Running topology on local mode and throwing million messages over and over, monitoring jvm's health should be good enough to identify the issue.

	---------------
	Latency of external services

	---------------
	Doubts
	First time running storm cluster in your own environment will bring in a lot of doubts. Esp. in a complicated infrastructure with Zk, Kafka, C*, ES, Storm, and some external services. Together with tweaking cluster configuration it can make the situation more complicated. Sometimes Storm cannot recover workers and a bad habit operation needs to be applied (stop cluster, clear queue, clear state,...).

	Rule: batch size, TridentState/Functions/QueryFunction life cycle, anti-patterns (shared state, static variable,...).

	---------------
	Storm performance turning
	Excellent notes: https://gist.github.com/mrflip/5958028