Skip to content

Instantly share code, notes, and snippets.

@framiere
Last active August 9, 2018 11:24
Show Gist options
  • Save framiere/0ec9806b3201b03eaf7b389fc3415c1c to your computer and use it in GitHub Desktop.
Save framiere/0ec9806b3201b03eaf7b389fc3415c1c to your computer and use it in GitHub Desktop.

Architectures trade offs

For the following architecture we suppose that you are using the acks=all property.

Note: even if you did explicitly set acks=all or set something different thanall if you are using EOS, idempotency or Transaction, the producer will set it to all in the background.

Note: when choosing anything different than acks=all you are guaranteed to lose messages regardless of the failure.

Note: when using more brokers than the replication factor, you have less possibility of data loss but you are less available when using quorum min.insync.replicas as partitions are scattered around brokers. Please set the replication factor and min.insync.replicas adequately.

Rack awareness

The rack awareness feature spreads replicas of the same partition across different racks. This extends the guarantees Kafka provides for broker-failure to cover rack-failure, limiting the risk of data loss should all the brokers on a rack fail at once. The feature can also be applied to other broker groupings such as DC.

You can specify that a broker belongs to a particular DC by adding a property to the broker config: broker.rack=dc1

When a topic is created, modified or replicas are redistributed, the rack constraint will be honoured, ensuring replicas span as many racks as they can (a partition will span min(#racks, replication-factor) different racks).

For more information please refer to https://docs.confluent.io/current/kafka/post-deployment.html .

DC 1: 2 Brokers - 2 ZK | DC2 : 1 Broker - 1 ZK

Failure Replication min.isr Available No Data loss
1 Broker failure 3 1
Network partition 3 1
DC1 is dead 3 1
DC2 is dead 3 1
1 Broker failure 3 2
Network partition 3 2
DC1 is dead 3 2
DC2 is dead 3 2

Only DC 2 can die if you want availability

Observer shuffle:

If DC1 dies, you can use the observer on DC2 to recreate a quorum. If you know that DC1 will take a lot of time to recover its failure, you need to stop kafka, accept data loss when it will be able to join back the cluster.

DC 1: 2 Brokers - 1 ZK | DC2 : 1 Broker - 2 ZK

This design cannot be chosen, as it gives you less benefits as the previous one.

Upon network partition, when a leader was set on DC 1 Broker, a new leader will be elected on DC 2. When the network will be restored, the log will have diverged on both dcs.

In order to get back to a predictable state, Kafka will truncate the leader with the oldest leader election time, thus effectively removing data that has been acked on DC 1.

For more information, please read https://cwiki.apache.org/confluence/display/KAFKA/KIP-101+-+Alter+Replication+Protocol+to+use+Leader+Epoch+rather+than+High+Watermark+for+Truncation[KIP-101 - Alter Replication Protocol to use Leader Epoch rather than High Watermark for Truncation]

Please use the DC 1: 2 Brokers - 2 ZK | DC2 : 1 Broker - 1 ZK design.

DC 1: 2 Brokers - 2 ZK | DC2 : 2 Brokers - 1 ZK

Failure Replication min.isr Available No Data Loss
1 Broker failure 3 1
Network partition 3 1
DC1 is dead 3 1
DC2 is dead 3 1
1 Broker failure 3 2
Network partition 3 2
DC1 is dead 3 2
DC2 is dead 3 2

| Failure | Replication | min.isr | Available | No Data Loss | | 1 Broker failure | 4 | 1 | ✓ | ✗ | | Network partition | 4 | 1 | ✓ | ✗ | | DC1 is dead | 4 | 1 | ✓ | ✗ | | DC2 is dead | 4 | 1 | ✓ | ✗ | | 1 Broker failure | 4 | 2 | ✓ | ✓ | | Network partition | 4 | 2 | ✓ | ✗ | | DC1 is dead | 4 | 2 | ✓ | ✗ | | DC2 is dead | 4 | 2 | ✓ | ✗ | | 1 Broker failure | 4 | 3 | ✓ | ✓ | | Network partition | 4 | 3 | ✗ | ✓ | | DC1 is dead | 4 | 3 | ✗ | ✗ | | DC2 is dead | 4 | 3 | ✗ | ✗ |

DC 1 or DC 2 die: same consequences.

Observer to participant update:

If DC1 dies, you can use an observer on DC2 to recreate a quorum by updating its type from observer to participant.

If you know that the partition will take a lot of time to recover, you need to stop kafka accept dataloss when it will be able to join back the cluster.

DC 1: 2 Brokers - 1 ZK | DC2 : 2 Brokers - 1 ZK | DC 3: 0 Broker - 1 ZK

Same properties as the DC 1: 2 Brokers - 2 ZK | DC2 : 2 Brokers - 1 ZK infrastructure, except you do not need to do the Observer to Participant update.

Observer shuffle:

If DC1 dies, you can use the observer on DC2 to recreate a quorum.

If you know that DC1 will take a lot of time to recover its failure, you need to stop kafka and accept dataloss when it will be able to join back the cluster.

DC 1: 1 Broker - 1 ZK | DC2 : 1 Broker - 1 ZK | DC3 : 1 Broker - 1 ZK

Failure Replication min.isr Available No Data loss
1 Broker failure 3 1
1 Network partition 3 1
DC1 is dead 3 1
DC2 is dead 3 1
DC3 is dead 3 1
1 Broker failure 3 2
1 Network partition 3 2
DC1 is dead 3 2
DC2 is dead 3 2
DC3 is dead 3 2

Rack awareness on 2 DC stretch cluster

When using rack awareness to balance data in multiple dc, you need to be aware that Kafka will only use the broker.rack information for replica assignment. Thus, if you want to be sure that data is on both DC, you need to have min.isr = floor((replication / 2) + 1)

The obvious consequence of choosing a min.insync.replicas that guarantees the data to be on both dc is that in an event of a network partition or a DC failure, you will NOT be able to write.

The non obvious consequence is that you will not be able to read either, as reading with the consumer group, or via Kafka Connect require writing offsets back to Kafka.

For more information, please read https://cwiki.apache.org/confluence/display/KAFKA/KIP-36+Rack+aware+replica+assignment[KIP-36 Rack aware replica assignment].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment