beppu/realtime-analytics.org

## realtime-analytics.org

      
    Raw
  

              realtime-analytics.org
            
          
    https://www.youtube.com/watch?v=fZGlmVMdKDs
Building Realtime Analytics Systems

Finding a solution

(Initially you might think…) Load all your data into Hadoop.  Query it.  Done

MapReduce/Hadoop has problems.  It can be slow.

Not optimized for query latency

Need to introduce a query layer to make queries faster.

Making Queries Faster

What types of queries to optimize for?

Intially just Hadoop

Then Hadoop for preprocessing and storage -> Various data stores

Existing data stores were not good enough, so they made their own called Druid.

http://druid.io/

Druid is optimized for reads

Druid is a column store

Kafka (Loading?)  (It’s a message queue of some sort)

Storm (Processing)

arch

Kafka Brokers -> Strom Workers -> Druid Realtime Workers -> Druid Historical Cluster
  \                /
  Druid Query Broker
What we gained

No need for hadoop

Faster

What we gave up

Difficult to handle corrections of existing data

Windows may be too small for fully accurate operations

Hadoop was actually good at these things

Hadoop might be brought back to rectify this.

What this might look like

Storm
  Kafka           Druid
  Hadoop
Glue to connect these pieces together

storm-kafka
  Camus
  Tranquility
  Druid already knows how to talk to Hadoop