Skip to content

Instantly share code, notes, and snippets.

@mathyourlife
Created September 17, 2014 13:45
Show Gist options
  • Star 11 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save mathyourlife/de07f22ca0978a5c09ce to your computer and use it in GitHub Desktop.
Save mathyourlife/de07f22ca0978a5c09ce to your computer and use it in GitHub Desktop.
kafka notes

Apache Kafka

"A high-throughput distributed messaging system." site

Notes taken from source

Overview

  • Created at LinkedIn (open sourced in 2011)
  • Implemented in scala and some java

Design Requirements

  • High throughput to support high volume event feeds
  • Support real-time processing of these feeds to create new, derived feeds
  • Support large data backlogs to handle periodic ingestion from offline systems
  • Support low-latency delivery to handle more traditional messaging use cases
  • Guarantee fault-tolerance in the presence of machine failure

Writes

  • writes go to the page cache of OS, i.e. RAM

Reads

  • direct transfer from page cache to network socket with sendfile, avoids coping data into kafka application for sending
  • A healthy kafka cluster will show mimimal read activity to disk as a is served primarily from cache.

LinkedIn Usage Stats

  • 300+ brokers
  • 18K+ topics
  • 140k+ partitions
  • 220 bil messages per day
  • 40 TB in
  • 160 TB out
  • peak 3.25 mil msg/sec
  • 5.5 Gbit/sec in
  • 18 Gbit/sec out

General Use Case

  • kafka + (Storm, Samza, Spark Streaming, homegrown)

Who uses

  • LinkedIn: activity streams, ops metrics, data bus
  • Netflix: real-time monitoring, event processing
  • Twitter: w/ storm for real-time data pipelines
  • Spotify: log delivery to hadoop
  • loggly: log collection and processing
  • Mozilla: telemetry data

Architecture

  • kafka brokers handle reads and writes
  • Hierarchy (Topic -> Partition -> Replication)
  • ZooKeeper manages and shares state for brokers and consumers (brokers only in kafka v0.9)
  • Topics/Partitions are append only and immutable sequence
  • Length of Topics/Partitions are governed by age, max size, or key
  • Offset determines location in the queue (monotonically increasing integer)
  • Consumers use the offset to track position in the queue
  • Replicas are solely for data loss prevention (never read from or written to)
  • Servers with large RAM used for serving data from cache (LinkedIn 64GB = 60 for cache and 4 for brokers)
  • RAID10 with 14 spindles
    • more spindles -> higher disk throughput
    • Cache on RAID with battery backup
  • 1 Gig ethernet
  • zookeeper servers -> run only zookeeper, 1 zookeeper per host
  • recommend SSDs
  • kafka clusters don't span data centers (latency to zookeeper servers?)
  • zookeeper cluster tolerates n/2 - 1 failures
    • LinkedIn runs 5 node ensembles
    • Twitter runs 13 node ensembles

Interface

Topic Creation

cli

kafka-topics.sh --zookeeper 1.2.3.4:2181 --create --topic topic.name \
   --partitions 3 --replication-factor 2 --config x=y

auto create

auto.create.topics.enable = true

Current Topics

kafka-topics.sh --zookeeper 1.2.3.4:2181 --describe --topic topic.name

Producers

  • Synchronous or Asynchronous
  • Sync blocks client on send()
  • Async sends message in background
    • Allows for batching of messages
    • Pools of Sync producers
    • Possible to drop messages if queue is full
  • Write Message ack
    • message considered committed when "any required" in-sync replicas for that partition have applied the message to their data log
    • "any required" defined by producers in request.required.acks
      • 0 - fire and forget (no ack required)
      • 1 - wait for leader to ack
      • -1 - wait for all ISR (in-sync replicas) ack
    • only committed messages are given to consumers
  • request.timeout.ms if too low can send error back to the client and then an ack from a broker
  • batch writes
    • higher risk for data loss if client dies before pending messages sent
  • default partitioner selects based on hash of key
    • so default behavior, kafka guarantees that all data for the same key will go to the same partition
  • if key is not selected, client will push to a partition and stick to it for a random period of time then switch to another random partition.
    • key is retained in the message in the broker
  • current list of topic/partition leaders provided to producers by brokers cp.metadata.broker.list

Brokers

  • Some brokers are designated for "bootstrap"
  • At least 1 "bootstrap" broker is required up or producers will stop working
  • Recommended to use VIP for "bootstrap" brokers

Consumers

  • storm 0.9.2 uses simple consumer API to integrate well with storm's model of guaranteed message processing
  • Consumers pull from kafka
  • Consumers are responsible for tracking their offset position
  • High level consumer: stores the offsets is zookeeper
  • Simple consumer: responsible for tracking offset
  • Consumers can rewind time back to the max time or max size of the topic/partition
  • Consumer can select subset of a topic's partitions to consume
  • config:
    • group.id assigns consumer to a "group"
    • zookeeper.connect broker/topic/etc discovery
    • ``:

Monitoring

  • System Tools
  • Replication tools
  • stormkafkamon
  • JMX
  • dropwizard/metrics
  • garbage collection time
  • data size on disk should be balanced across brokers/disks
  • data balance is more important than partition balance
  • Leader partition count should be balanced (avoid hot node since all reads/writes go against the associated leader topic/partition)
  • track network utilization
    • common cause for under-replicated partition
  • request latency (if on SSDs should be esentially <1 ms)
  • outstanding requests (increasing means items backing up)
  • LinkedIn working on adding "Auditing" functionality
    • not yet released

Performance Turning

  • vm.swappiness = 0
  • Allow more dirty pages but less dirty cache vm.dirty_*_ratio
  • Lengthen flush interval (linked in uses 120s !? They can tolerate 2 mins worth of data loss with replicas)
  • Additional spindles (RAID10 with 14 disks)
  • num.io.threads: should be >= #disks (initial to # disks)
  • num.network.threads: adjust based on concurrent # producers, consumers, replication factor
  • Compression
    • snappy and gzip are supported. Snappy recommended. Lower compression, but not as cpu intensive.

Ops

Tutorials

Tales of Caution

General

  • Don't break up into separate topics unless data is truly independent
  • Keep time related messages in the same partition
  • Async producers can drop messages send queue is full (defualt queue.buffering.max.messages 10,000)
  • producer request.timeout.ms if too low can send error back to the client and then an ack from a broker
  • Use VIPs for bootstrap broker list to help with up/down of "bootstrap" brokers metadata.broker.list

Aphyr - Call Me Maybe blog

kafka post

Security

  • Kafka was not designed originally with security in mind
  • june 2014 adding security features (TLS, data encryption at rest)

Common Initial Issues

  • Garbage collection
    • suggest "G1 garbage first" gc
  • Educating and coaching on kafka use
  • Expanding/reducing size of kafka cluster
  • Monitoring consumer apps ("My stuff stopped working, what did kafka do?")
  • Consumer lag
    • consumers too slow
    • too much GC
    • Loss of connection to zookeeper servers or brokers
    • bug or design flaw
  • Rebalancing
    • Multiple new nodes trigger repeated rebalancing that hit kafka's rebalancing limit cf.rebalance.max.retries (default 4)
    • New script in v0.8.1 to balance data/partitions across brokers
  • Problems with partition healing (CAP)

Questions

  • Measure replication lag?
  • Does kafka fsync on each write? if not flush interval?
  • Election algorithm
  • How do you ensure unique broker ids in automated environment?
  • Measure health of zookeeper server
  • review puppet-kafka recipe
  • What happens to consumer that restarts and looses offset value, start at beginning or end of queue?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment