Skip to content

Instantly share code, notes, and snippets.

@rajibkuet07
Last active April 21, 2024 13:30
Show Gist options
  • Save rajibkuet07/1e46399d8d9ba3d0676bdedb3054a792 to your computer and use it in GitHub Desktop.
Save rajibkuet07/1e46399d8d9ba3d0676bdedb3054a792 to your computer and use it in GitHub Desktop.
Kafka Retry Mechanism

Retry Mechanism in Kafka

  • Create a database table for the retry events - id, topic, partition, message, hash(md5(the unique identifier for the failed message)), offset, counter(count of retry for this message), nextEventTime(DateTime | gradually increase based on the counter), published(false | true whether or not a retry message is published in the topic again), handled(false | true whether the message handled again), status(success | error status of the event when handled again)
  • When an error happens in any consumer emit an event to the retry topic. The retry topic can get a message consisting of hash(md5(the unique identifier for the failed message), topic(topic name of the failed message), partition(partition number of the failed message), message(actual message of the failed message), error(error data or string for which the previous message failed), fallback(optional | {topic: "fallback topic name", value: "fallback message value"} | if max count exceeds then produce this fallback topic with the value as message), configs(optional | if some configs need for this retry event message like initial retry time, max retry time, max count))
  • In the retry topic consumer, it will store a retry event data for the failed event in the database
  • Add an api endpoint so that from outside, the failed topic with data from the database could be run again. For example from the server, we can add a cron to hit this endpoint and produce the failed messages again
  • In the api endpoint - if any events in the database have nextEventTime less than or equal to the current time then produce these events in the topic with the same messages(stored in the database) again. After producing the retry events update the publish status to omit the racing condition of producing the same events multiple times
  • Steps In Consumer That Cloud Fail
    • Consumer on Success
      • Check db for handled = false and hash = md5(unique identifier for the failed message)
      • If not found then do nothing
      • If found then update handled = true and status = success
    • Consumer on Fail
      • Emit event to the retry topic with specific data in the message
  • Steps In The Retry Topic Consumer
    • Check db for handled = false and hash = md5(unique identifier for the failed message)
    • If found then update the handled = true and status = error of the found row
    • Add a new row with the next retry event if the counter does not exceed the max retry count
    • In the new row add the counter = (previous counter || 0) + 1 and nextEventTime = currentTime + (2^counter * 60000(initial interval time in milliseconds from configs or env))
    • If the counter of the found row equals the max retry count from the configs or env then it will check if the retry topic message has any fallback details fallback = {topic: string, value: any}
    • If any fallback detail is found then it will emit an event to the fallback topic with the value in the fallback

help - https://github.com/uptool/gmb-site-backend/blob/master/src/global/kafka/retry-event-consumer.service.ts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment