Skip to content

Instantly share code, notes, and snippets.

@beppu
Created November 11, 2015 23:59
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save beppu/dbe352a8f143c2cd2aa9 to your computer and use it in GitHub Desktop.
Save beppu/dbe352a8f143c2cd2aa9 to your computer and use it in GitHub Desktop.

https://www.youtube.com/watch?v=fZGlmVMdKDs

Building Realtime Analytics Systems

Finding a solution

(Initially you might think…) Load all your data into Hadoop. Query it. Done

MapReduce/Hadoop has problems. It can be slow.

Not optimized for query latency

Need to introduce a query layer to make queries faster.

Making Queries Faster

What types of queries to optimize for?

Intially just Hadoop

Then Hadoop for preprocessing and storage -> Various data stores

Existing data stores were not good enough, so they made their own called Druid.

Druid is optimized for reads

Druid is a column store

Kafka (Loading?) (It’s a message queue of some sort)

Storm (Processing)

arch

Kafka Brokers -> Strom Workers -> Druid Realtime Workers -> Druid Historical Cluster \ / Druid Query Broker

What we gained

No need for hadoop

Faster

What we gave up

Difficult to handle corrections of existing data

Windows may be too small for fully accurate operations

Hadoop was actually good at these things

Hadoop might be brought back to rectify this.

What this might look like

Storm Kafka Druid Hadoop

Glue to connect these pieces together

storm-kafka Camus Tranquility Druid already knows how to talk to Hadoop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment