Skip to content

Instantly share code, notes, and snippets.

@manku-timma
Last active August 29, 2015 14:04
Show Gist options
  • Save manku-timma/c7fc9701230801a9fee8 to your computer and use it in GitHub Desktop.
Save manku-timma/c7fc9701230801a9fee8 to your computer and use it in GitHub Desktop.
How to install and use storm on centos 6.4
  • Download axel :)

  • Download and install mvn

  • Download storm from https://storm.incubator.apache.org/downloads.html

  • cd apache-storm-0.9.2-incubating/examples/storm-starter

  • mvn compile exec:java -Dstorm.topology=storm.starter.WordCountTopology

  • Lots of output tuples of the form

  • Emitting: count default [nature, 49]

  • mvn compile exec:java -Dstorm.topology=storm.starter.BracketTopology -Dexec.args="arg1"

  • Hadoop-mapreduce-implementation <-> storm implementation

  • Master node, worker node, Co-ordination cluster

  • Jobtracker <-> Nimbus <-> master node

  • Worker node <-> supervisor (multiple workers processes)

  • Submit a computation graph where compute code is in vertices and edges tell what data to send and to whom

  • graph is called topology

  • nodes are spouts or bolts and allows UDF

  • spout produces one stream

  • bolt consumes multiple streams and produces multiple streams

  • a stream can be subscribed to by multiple bolts

  • all nodes run in parallel

  • even if all machines go down and messages are dropped, no data is lost !!!

  • tuples can be emitted at any time from a bolt - in methods like prepare, execute etc or async in another thread

  • stream grouping object tells storm how to send tuples to downstream tasks

  • Reliability

  • storm tracks all messages generated due to a single source message from a spout

  • for one tuple to be "fully processed" if the tuple dag generated by this tuple across the storm cluster has been fully processed and acked and every message has been drained from the system. This should happen within 30 secs

  • ack/fail is delivered to the actual spout task that created the tuple

  • kestrel's model is:

  • an event starts in new state

  • an event can be opened (put in pending state and not sent to other clients and assigned an event id)

  • once fully processed it can be acked

  • if client disconnects all pending events are put back to new state

  • this kestrel model and event id is used by KestrelSpout to deal with events from kestrel and ensure reliability of processing

  • an output tuple is anchored to one or more input tuples when it is being emitted. this makes sure that parenthood can be tracked

  • multi-parents will help when doing streaming joins or aggregations

  • an input tuple of acked when a bolt fully processes it

  • anchoring and acking can be done by BasicBolt if needed

  • for aggregation it is better to not ack until a bunch of tuples have been processed

  • tracking the tuple dag is really smart

    • there are a set of acker threads (can be any number)
    • each acker maintains (source task id, message id, xor of all tuple ids) for a subset of messages
    • mod hashing on spout tuple id (original tuple id) is used to talk to the relevant acker thread
    • spout sends (message id, task id, spout tuple id) so that acker can track the source of the tuple
    • when a bolt sends a message it sends (spout tuple id of source tuple, new tuple id) to acker
    • when a bolt acks a message it sends (message id, spout tuple id, list of tuple ids removed, list of tuple ids added) to acker
  • Linear DRPC - this is a cool way of parallelizing and distributing general computation with results

  • CEP (complex event processing) - process multiple event streams and infer conclusions

  • Streaming join

  • temporal (overlapping in time)

  • historical (against all data)

  • conviva uses spark for speeding up adhoc queries

  • computation as a graph v/s computation on a graph v/s computation as a pipeline

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment