Skip to content

Instantly share code, notes, and snippets.

@szeitlin
Last active November 12, 2015 00:51
Show Gist options
  • Save szeitlin/3f1f2b4845344fd2577d to your computer and use it in GitHub Desktop.
Save szeitlin/3f1f2b4845344fd2577d to your computer and use it in GitHub Desktop.
apache kafka notes

10 November 2015, Jay Kreps

problem definition:

"let's get all the data in hadoop!"

  • coverage
  • hetergeneous systems
  • data formats
  • constant change

Step 1: learn how other people do data flow

== "scary in 2009"

Wanted user events, initially did everything in batches with rsync

messaging: app --> broker --> processor

Everything was sort of working alone, but they didn't scale equally (big snarl)

wanted an infrastructure solution "hit all your problems with a really good hammer"

1. Everything is an event

  • DB change
  • user action
  • metrics
  • logs

2. Data pipeline == Event Stream

kafka is an SDP: Stream Data Platform

less coordination required for APIs and partner orgs to be able to use the data

Many use cases that were not considered when the data were originally published

Had used ActiveMQ/RabbitMQ

problems:

  • throughput was too low
  • persistence was too short and unstable (spilling to disk was unpredictable)
  • no partitioning or ordering for doing search/analytics

** producers --> kafka cluster --> consumers **

core abstraction is a log, not a queue

i.e., an ordered sequence of records

pub-sub model

past <--- reader, reader --> writer ---> future

readers are basically just pointers

Kafka topic: a partitioned set of logs Like a filesystem, but multiwriter and expected to continue forever really optimized to make it efficient to always be checking the tail

Groups: batches of processes each topic is multi-subscriber

hundreds of MB/sec throughput many TB/server O(1) writes don't want it to slow down as you scale up

distributed by design:

  • replication
  • fault-tolerant
  • partitioning
  • elastic scale

3 paradigms to consider:

  1. request-response (1:1)

  2. batch (all:all)

  3. stream (some:some)

Stream processing

central copy of real-time data events

nowadays, stream processing is no longer transient, approximate, and lossy

batch data collection means you're doing batch processing

to do stream processing you have to stream the data

General idea is topic -> transform -> topic -> transform, etc. in a pipeline

still want to decouple the input & output from the logic

STORM/Samza/Spark Streaming + Kafka = Stream Processing

RT analytics stack:

data --> kafka + stream processing -> OLAP/timseries DB --> charting

Everyone started looking at the data, totally changed the culture

now have 1.2 trillion messages/day written 3.4 trillion message/day read ~ 1 PB stored in kafka clusters

data are overwritten if you don't access them for 7 days after they're initially collected

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment