https://www.youtube.com/watch?v=fZGlmVMdKDs
Building Realtime Analytics Systems
(Initially you might think…) Load all your data into Hadoop. Query it. Done
MapReduce/Hadoop has problems. It can be slow.
Not optimized for query latency
Need to introduce a query layer to make queries faster.
What types of queries to optimize for?
Then Hadoop for preprocessing and storage -> Various data stores
Existing data stores were not good enough, so they made their own called Druid.
Druid is optimized for reads
Kafka (Loading?) (It’s a message queue of some sort)
Kafka Brokers -> Strom Workers -> Druid Realtime Workers -> Druid Historical Cluster
\ /
Druid Query Broker
Difficult to handle corrections of existing data
Windows may be too small for fully accurate operations
Hadoop was actually good at these things
Hadoop might be brought back to rectify this.
What this might look like
Storm
Kafka Druid
Hadoop
Glue to connect these pieces together
storm-kafka
Camus
Tranquility
Druid already knows how to talk to Hadoop