manku-timma/gist:c7fc9701230801a9fee8

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
Download axel :)


Download and install mvn


Download storm from https://storm.incubator.apache.org/downloads.html


cd apache-storm-0.9.2-incubating/examples/storm-starter


mvn compile exec:java -Dstorm.topology=storm.starter.WordCountTopology


Lots of output tuples of the form


Emitting: count default [nature, 49]


mvn  compile exec:java -Dstorm.topology=storm.starter.BracketTopology  -Dexec.args="arg1"


Hadoop-mapreduce-implementation <-> storm implementation


Master node, worker node, Co-ordination cluster


Jobtracker <-> Nimbus <-> master node


Worker node <-> supervisor (multiple workers processes)


Submit a computation graph where compute code is in vertices and edges tell what data to send and to whom


graph is called topology


nodes are spouts or bolts and allows UDF


spout produces one stream


bolt consumes multiple streams and produces multiple streams


a stream can be subscribed to by multiple bolts


all nodes run in parallel


even if all machines go down and messages are dropped, no data is lost !!!


tuples can be emitted at any time from a bolt - in methods like prepare, execute etc or async in another thread


stream grouping object tells storm how to send tuples to downstream tasks


Reliability


storm tracks all messages generated due to a single source message from a spout


for one tuple to be "fully processed" if the tuple dag generated by this tuple across the storm cluster has been fully processed and acked and every message has been drained from the system. This should happen within 30 secs


ack/fail is delivered to the actual spout task that created the tuple


kestrel's model is:


an event starts in new state


an event can be opened (put in pending state and not sent to other clients and assigned an event id)


once fully processed it can be acked


if client disconnects all pending events are put back to new state


this kestrel model and event id is used by KestrelSpout to deal with events from kestrel and ensure reliability of processing


an output tuple is anchored to one or more input tuples when it is being emitted. this makes sure that parenthood can be tracked


multi-parents will help when doing streaming joins or aggregations


an input tuple of acked when a bolt fully processes it


anchoring and acking can be done by BasicBolt if needed


for aggregation it is better to not ack until a bunch of tuples have been processed


tracking the tuple dag is really smart

there are a set of acker threads (can be any number)
each acker maintains (source task id, message id, xor of all tuple ids) for a subset of messages
mod hashing on spout tuple id (original tuple id) is used to talk to the relevant acker thread
spout sends (message id, task id, spout tuple id) so that acker can track the source of the tuple
when a bolt sends a message it sends (spout tuple id of source tuple, new tuple id) to acker
when a bolt acks a message it sends (message id, spout tuple id, list of tuple ids removed, list of tuple ids added) to acker


Linear DRPC - this is a cool way of parallelizing and distributing general computation with results


CEP (complex event processing) - process multiple event streams and infer conclusions


Streaming join


temporal (overlapping in time)


historical (against all data)


conviva uses spark for speeding up adhoc queries


computation as a graph v/s computation on a graph v/s computation as a pipeline