topher6345/kafka_notes.md

## kafka_notes.md

      
    Raw
  

              kafka_notes.md
            
          
    Apache Kafka


a new messaging-based log aggregator


a distributed messaging system


horizontally scalable messaging system.


Memory Mapped Files


Kernel Space processing

http://sites.computer.org/debull/A12june/pipeline.pdf
https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
http://sites.computer.org/debull/A12june/pipeline.pdf
http://research.microsoft.com/en-us/um/people/lamport/pubs/lamport-paxos.pdf
http://research.microsoft.com/en-us/um/people/lamport/pubs/paxos-simple.pdf
http://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf
https://www.youtube.com/watch?v=JEpsBg0AO6o
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/spanner-osdi2012.pdf

http://www.slideshare.net/chriscurtin/ajug-march-2013-kafka slide 17


Why
How
Who

http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf
Why


We needed to support most of the real-time applications mentioned above with delays of no more than a few seconds.


Every day, China Mobile collects 5–8TB of phone call records [11] and Facebook gathers almost 6TB of various user activity events [12]


Kafka has superior performance when compared to two popular messaging systems.


developed for collecting and delivering high volumes of log data with low latency.

Why is Kafka Interesting
Producers and Consumers are about business logic.
How


we find the “pull” model more suitable for our applications since each consumer can retrieve the messages at the maximum rate it can sustain and avoid being flooded by messages pushed faster than it can handle. The pull model also makes it easy to rewind a consumer...


A stream of messages of a particular type is defined by a topic. A producer can publish messages to a topic. The published messages are then stored at a set of servers called brokers. A consumer can subscribe to one or more topics from the brokers, and consume the subscribed messages by pulling data from the brokers.


A message is defined to contain just a payload of bytes.


To subscribe to a topic, a consumer first creates one or more message streams for the topic


The messages published to that topic will be evenly distributed into these sub-streams.


To balance load, a topic is divided into multiple partitions and each broker stores one or more of those partitions.

Terminology

http://www.slideshare.net/chriscurtin/ajug-march-2013-kafka slide 18

Topics are the main grouping mechanism for messages
Brokers store the messages, take care of redundancy issues
Producers write messages to a broker for a specific topic
Consumers read from Brokes for a specific topic
Topics can be furthers segmented by partitions
consumers can read a specific partition from a Topic

Simple Storage


Each partition of a topic corresponds to a logical log. Physically, a log is implemented as a set of segment files of approximately the
same size


Every time a producer publishes a message to a partition, the broker simply appends the message to the last segment file.


A message is only exposed to the consumers after it is flushed.


a message stored in Kafka doesn’t have an explicit message id. Instead, each message is addressed by its logical offset in the log. This avoids the overhead of maintaining auxiliary, seek-intensive random-access index structures that map the message ids to the actual message locations. Note that our message ids are increasing but not consecutive.


consumer always consumes messages from a particular
partition sequentially

Efficient Transfer


the end consumer API iterates one message at a time, under the covers, each pull request from a consumer also retrieves multiple messages up to a certain size, typically hundreds of kilobytes.


avoid explicitly caching messages in memory at the Kafka layer. Instead, we rely on the underlying file system page cache.


avoiding double buffering---messages are only cached in the page cache.


normal operating system caching heuristics are very effective.


both the production and the consumption have consistent performance linear to the data size, up to many terabytes of data.


optimize the network access for consumers. Kafka is a multi-subscriber system and a single message may be consumed multiple times by different consumer applications.

Stateless broker


Unlike most other messaging systems, in
Kafka, the information about how much each consumer has
consumed is not maintained by the broker, but by the consumer
itself. Such a design reduces a lot of the complexity and the
overhead on the broker.

How to delete messages?

a broker doesn’t know whether all subscribers
have consumed the message.

time-based SLA for the retention policy

A message is automatically deleted if it has been retained in the broker longer than a certain period, typically 7 days.


There is an important side benefit of this design. A consumer can deliberately rewind back to an old offset and re-consume data. This violates the common contract of a queue, but proves to be an essential feature for many consumers.


If the consumer crashes, the unflushed data is lost


rewinding a consumer is much easier to
support in the pull model than the push model.

Distributed Coordination


Each producer can publish a message to either a randomly selected partition or a partition semantically determined by a partitioning key and a partitioning function.

Consumer Groups


Our goal is to divide the messages stored in the brokers
evenly among the consumers, without introducing too much
coordination overhead.


Each consumer group consists of one or more consumers that jointly consume a set of subscribed topics, i.e., each message is delivered to only one of the consumers within the group.


Our first decision is to make a partition within a topic the smallest unit of parallelism. This means that at any given time, all messages from one partition are consumed only by a single consumer within each consumer group. Had we allowed multiple consumers to simultaneously consume a single partition, they would have to coordinate who consumes what messages, which necessitates locking and state maintenance overhead. In contrast, in our design consuming processes only need co-ordinate when the consumers rebalance the load, an infrequent event.

Zookeeper - a highly available consensus service

not have a central “master” node, but instead let consumers coordinate among themselves in a decentralized fashion. Adding a master can complicate the system since we have to further worry about master failures.

Kafka uses Zookeeper for the following tasks:

detecting the addition and the removal of brokers and consumers,
triggering a rebalance process in each consumer when the above events happen
maintaining the consumption relationship and keeping track of the consumed offset of each partition

Delivery Guarantees


Kafka only guarantees at-least-once delivery. Exactlyonce
delivery typically requires two-phase commits and is not
necessary for our applications.


in the case when a consumer process crashes without a clean shutdown, the consumer process that takes over those partitions owned by the failed consumer may get some duplicate messages that are after the last offset successfully committed to zookeeper


If an application cares about duplicates, it must add its own deduplication logic, either using the offsets that we return to the consumer or some unique key within the message


If the storage system on a broker is permanently damaged, any unconsumed message is lost forever.

Consumption Management

http://www.slideshare.net/chriscurtin/ajug-march-2013-kafka slide 16

Kafka leaves management of what was consumed up to the business logic
Each message has a unique identifier (within the top and partition)
Consumers can ask for messages by identifier, even if the message is days old.
Identifiers are sequential within a top and partition.

Who

https://cwiki.apache.org/confluence/display/KAFKA/Powered+By