10 November 2015, Jay Kreps
problem definition:
"let's get all the data in hadoop!"
- coverage
- hetergeneous systems
- data formats
- constant change
Step 1: learn how other people do data flow
== "scary in 2009"
Wanted user events, initially did everything in batches with rsync
messaging: app --> broker --> processor
Everything was sort of working alone, but they didn't scale equally (big snarl)
wanted an infrastructure solution "hit all your problems with a really good hammer"
1. Everything is an event
- DB change
- user action
- metrics
- logs
2. Data pipeline == Event Stream
kafka is an SDP: Stream Data Platform
less coordination required for APIs and partner orgs to be able to use the data
Many use cases that were not considered when the data were originally published
Had used ActiveMQ/RabbitMQ
problems:
- throughput was too low
- persistence was too short and unstable (spilling to disk was unpredictable)
- no partitioning or ordering for doing search/analytics
** producers --> kafka cluster --> consumers **
core abstraction is a log, not a queue
i.e., an ordered sequence of records
pub-sub model
past <--- reader, reader --> writer ---> future
readers are basically just pointers
Kafka topic: a partitioned set of logs Like a filesystem, but multiwriter and expected to continue forever really optimized to make it efficient to always be checking the tail
Groups: batches of processes each topic is multi-subscriber
hundreds of MB/sec throughput many TB/server O(1) writes don't want it to slow down as you scale up
distributed by design:
- replication
- fault-tolerant
- partitioning
- elastic scale
3 paradigms to consider:
-
request-response (1:1)
-
batch (all:all)
-
stream (some:some)
Stream processing
central copy of real-time data events
nowadays, stream processing is no longer transient, approximate, and lossy
batch data collection means you're doing batch processing
to do stream processing you have to stream the data
General idea is topic -> transform -> topic -> transform, etc. in a pipeline
still want to decouple the input & output from the logic
STORM/Samza/Spark Streaming + Kafka = Stream Processing
RT analytics stack:
data --> kafka + stream processing -> OLAP/timseries DB --> charting
Everyone started looking at the data, totally changed the culture
now have 1.2 trillion messages/day written 3.4 trillion message/day read ~ 1 PB stored in kafka clusters
data are overwritten if you don't access them for 7 days after they're initially collected