For the following architecture we suppose that you are using the acks=all
property.
Note: even if you did explicitly set acks=all
or set something different thanall
if you are using EOS, idempotency or Transaction, the producer will set it to all
in the background.
Note: when choosing anything different than acks=all
you are guaranteed to lose messages regardless of the failure.
Note: when using more brokers than the replication factor, you have less possibility of data loss but you are less available when using quorum min.insync.replicas
as partitions are scattered around brokers.
Please set the replication factor and min.insync.replicas
adequately.
The rack awareness feature spreads replicas of the same partition across different racks. This extends the guarantees Kafka provides for broker-failure to cover rack-failure, limiting the risk of data loss should all the brokers on a rack fail at once. The feature can also be applied to other broker groupings such as DC.
You can specify that a broker belongs to a particular DC by adding a property to the broker config: broker.rack=dc1
When a topic is created, modified or replicas are redistributed, the rack constraint will be honoured, ensuring replicas span as many racks as they can (a partition will span min(#racks, replication-factor)
different racks).
For more information please refer to https://docs.confluent.io/current/kafka/post-deployment.html .
Failure | Replication | min.isr | Available | No Data loss |
---|---|---|---|---|
1 Broker failure | 3 | 1 | ✓ | ✗ |
Network partition | 3 | 1 | ✓ | ✗ |
DC1 is dead | 3 | 1 | ✓ | ✗ |
DC2 is dead | 3 | 1 | ✓ | ✗ |
1 Broker failure | 3 | 2 | ✓ | ✓ |
Network partition | 3 | 2 | ✓ | ✓ |
DC1 is dead | 3 | 2 | ✗ | ✗ |
DC2 is dead | 3 | 2 | ✓ | ✓ |
Only DC 2 can die if you want availability
If DC1 dies, you can use the observer on DC2 to recreate a quorum. If you know that DC1 will take a lot of time to recover its failure, you need to stop kafka, accept data loss when it will be able to join back the cluster.
This design cannot be chosen, as it gives you less benefits as the previous one.
Upon network partition, when a leader was set on DC 1 Broker, a new leader will be elected on DC 2. When the network will be restored, the log will have diverged on both dcs.
In order to get back to a predictable state, Kafka will truncate the leader with the oldest leader election time, thus effectively removing data that has been acked on DC 1.
For more information, please read https://cwiki.apache.org/confluence/display/KAFKA/KIP-101+-+Alter+Replication+Protocol+to+use+Leader+Epoch+rather+than+High+Watermark+for+Truncation[KIP-101 - Alter Replication Protocol to use Leader Epoch rather than High Watermark for Truncation]
Please use the DC 1: 2 Brokers - 2 ZK | DC2 : 1 Broker - 1 ZK design.
Failure | Replication | min.isr | Available | No Data Loss |
---|---|---|---|---|
1 Broker failure | 3 | 1 | ✓ | ✗ |
Network partition | 3 | 1 | ✓ | ✗ |
DC1 is dead | 3 | 1 | ✓ | ✗ |
DC2 is dead | 3 | 1 | ✓ | ✗ |
1 Broker failure | 3 | 2 | ✓ | ✓ |
Network partition | 3 | 2 | ✓ | ✗ |
DC1 is dead | 3 | 2 | ✓ | ✗ |
DC2 is dead | 3 | 2 | ✓ | ✗ |
| Failure | Replication | min.isr | Available | No Data Loss | | 1 Broker failure | 4 | 1 | ✓ | ✗ | | Network partition | 4 | 1 | ✓ | ✗ | | DC1 is dead | 4 | 1 | ✓ | ✗ | | DC2 is dead | 4 | 1 | ✓ | ✗ | | 1 Broker failure | 4 | 2 | ✓ | ✓ | | Network partition | 4 | 2 | ✓ | ✗ | | DC1 is dead | 4 | 2 | ✓ | ✗ | | DC2 is dead | 4 | 2 | ✓ | ✗ | | 1 Broker failure | 4 | 3 | ✓ | ✓ | | Network partition | 4 | 3 | ✗ | ✓ | | DC1 is dead | 4 | 3 | ✗ | ✗ | | DC2 is dead | 4 | 3 | ✗ | ✗ |
DC 1 or DC 2 die: same consequences.
If DC1 dies, you can use an observer on DC2 to recreate a quorum by updating its type from observer
to participant
.
If you know that the partition will take a lot of time to recover, you need to stop kafka accept dataloss when it will be able to join back the cluster.
Same properties as the DC 1: 2 Brokers - 2 ZK | DC2 : 2 Brokers - 1 ZK infrastructure, except you do not need to do the Observer to Participant update.
If DC1 dies, you can use the observer on DC2 to recreate a quorum.
If you know that DC1 will take a lot of time to recover its failure, you need to stop kafka and accept dataloss when it will be able to join back the cluster.
Failure | Replication | min.isr | Available | No Data loss |
---|---|---|---|---|
1 Broker failure | 3 | 1 | ✓ | ✗ |
1 Network partition | 3 | 1 | ✓ | ✗ |
DC1 is dead | 3 | 1 | ✓ | ✗ |
DC2 is dead | 3 | 1 | ✓ | ✗ |
DC3 is dead | 3 | 1 | ✓ | ✗ |
1 Broker failure | 3 | 2 | ✓ | ✓ |
1 Network partition | 3 | 2 | ✓ | ✓ |
DC1 is dead | 3 | 2 | ✓ | ✓ |
DC2 is dead | 3 | 2 | ✓ | ✓ |
DC3 is dead | 3 | 2 | ✓ | ✓ |
When using rack awareness to balance data in multiple dc, you need to be aware that Kafka will only use the broker.rack
information for replica assignment.
Thus, if you want to be sure that data is on both DC, you need to have min.isr = floor((replication / 2) + 1)
The obvious consequence of choosing a min.insync.replicas
that guarantees the data to be on both dc is that in an event of a network partition or a DC failure, you will NOT be able to write.
The non obvious consequence is that you will not be able to read either, as reading with the consumer group, or via Kafka Connect require writing offsets back to Kafka.
For more information, please read https://cwiki.apache.org/confluence/display/KAFKA/KIP-36+Rack+aware+replica+assignment[KIP-36 Rack aware replica assignment].