Event Hubs provides recovery and replication features to ensure that your data is safe and available in the case of a disaster. Data replication to a secondary region introduces some problems because the secondary region will not have the same offsets as the primary region. We cannot use offset because the offset within the secondary region will not match the offset in the primary region. The suggested solution is to use an epoch to denote the number of times the primary region changes.
- Sequence number: Long denoting the order in which events were sent to the Event Hub.
- Replication segment: Integer denoting the replication configuration.
- In service design documentation, they call this "replication epoch". Choosing replication segment will avoid confusion for the definition of "epoch".
- Offset: Long denoting the position of an event within a partition of a region, same event will have different offsets once replicated
- Add
Integer getReplicationSegment()
to EventData. - Add
Integer getLastEnqueuedReplicationSegment()
toPartitionProperties
- Add
Integer getBeginningReplicationSegment()
toPartitionProperties
- Add
Integer getReplicationSegment()
toCheckpoint
. - Add
Integer getReplicationSegment()
toLastEnqueuedEventProperties
.- NOTE: Unsure about this one. Do we fully name it? The class already has the name. Is this populated when geo-DR is enabled?
- Add overloads to
EventPosition
to create a position using epoch.EventPosition fromSequenceNumber(long sequenceNumber, int replicationSegment)
EventPosition fromSequenceNumber(long sequenceNumber, int replicationSegment, boolean isInclusive)
- Add "com.microsoft:georeplication" to DesiredCapabilities when creating a receive or send link.
- There is no current scenario regarding the producer, but ZZ mentions it may allow for flexibility.
- Update receiver filter to include both replication segment and sequence number in the form "amqp.annotation.x-opt-sequence-number" '<ReplicationSegment>:<SequenceNumber>'.
- Example: Replication segment is 10, sequence number is 1555. The field "amqp.annotation.x-opt-sequence-number" contains 10:1555
- Extract replication segment to populate
PartitionProperties
. Keys arebegin_sequence_number_epoch
andlast_enqueued_sequence_number_epoch
in the map. - Extract replication segment to populate
EventData.getReplicationSegment()
andSystemProperties
. Key isx-opt-sequence-number-epoch
. - Use both replication segment and sequence number to consume events.
- When checkpointing, include the replication segment in the checkpoint.
- Metadata key: "replicationsegment"
- Prioritise looking for sequence number and replication segment rather than offset in our internal checkpoint logic.
- IIRC, our current logic checks for the Checkpoint's offset, then sequence number.
- When
trackLastEnqueuedEventProperties
is set to true, extract replication segment value from AMQP message annotation section.
Scenario for existing checkpoints:
- Customer was using v1.0 of the library to process their events and persisted checkpoints.
- Customer upgrades to v1.1 with GeoDR support.
- Starts up their app and begins processing the same Event Hub.
Scenario for non-existing checkpoints:
- Customer updates to v.1.1 with GeoDR support.
- Starts processing events on a new Event Hub.
What the SDK would do:
- The client sends the replication capability (automatically, no explicit opt-in).
- A reader requests a starting position with sequence number, but no epoch specified.
- Client library can either pass in -1 or "" (empty string). Service will assume a default.
- FYI. No implementation on our end.
- Old client library does not send the "com.microsoft:georeplication" property.
- Service will either assume an epoch by default or have an opt-in to fail.
- The client sends the replication capability (automatically, no explicit opt-in)
- Service recognises it is not GeoDR, and ignores the capability.
- Works as it normally does:
- A reader requests a starting position with sequence number, but no epoch specified
- A reader requests a starting position with offset – the service accepts this as valid
- A reader requests a starting position with an enqueue time – the service accepts this as valid
- All the same behaviour as existing library.
Looks good to me, a couple of small thoughts:
Replication Segment
as our term for this newepoch
overload.