This is based on the partition assignment strategy configuration, an important note here is that this is a topic-specific configuration. It can only be used to describe the strategy used to assign partitions to a specific consumer group.
The default strategy is called the RangeAssignor
strategy. In this strategy, we divide the number of partitions over the number of consumer instances, and assign one by one. If the division does not result in a nice round number, the first few consumer instances are assigned more partitions.
Suppose there are two consumers C0 and C1, two topics t0 and t1, and each topic has 3 partitions, resulting in the following partitions: t0p0, t0p1, t0p2, t1p0, t1p1, and t1p2
The assignment for these will be:
C0: [t0p0, t0p1, t1p0, t1p1]
C1: [t0p2, t1p2]
We may have a consumer group that listens to multiple topics. If they have the same key-partitioning scheme and number of partitions across two topics, we can join data across the two topics.
In other words, if we had two topics partitioned by a UserID
, with the same number of partitions, we can ensure that the same consumer instance consumes all events for any specific user (the order of course is still only guaranteed by partition, and thereby topic in this scenario). Using the previous example, given we have only two partitions in t1 and t2; the assignment here will be:
Ci: [t0p0, t1p0]
Cj: [t0p1, t1p1]
And the result of PartitionerFunc(UserID, len(partitions))
will be the same here, so for any pariticular UserID
, our messages would be consumed by the same consumer.
For example - you might be listening to a central stream of events, and have a separate retry topic (for events that failed being handled). Both topics are partitioned by UserID
, and you want the same consumer to consume all a particular User's events. You might have done this so that each consumer instance could maintain a user-based cache.
This feature will also enable us to do "joins" of data across multiple topics, paritioned using the same key. This is the feature that Kafka Streams is built on.
Using the above example - if we had mismatched partition counts, we could wind up having events from the same key being consumed by different consumer instances in the group.
- A more detailed overview of the other available partition assignor strategies: https://medium.com/streamthoughts/understanding-kafka-partition-assignment-strategies-and-how-to-write-your-own-custom-assignor-ebeda1fc06f3