sheac/pa-13238-results.md

## pa-13238-results.md

      
    Raw
  

              pa-13238-results.md
            
          
    Question

We are dropping events <1MB. Investigate frequency and identify any possible patterns to determine cause.
Jira Task: https://parsable.atlassian.net/browse/PA-13238

Results

Cause of dropped messages

Every case where we dropped messages that don't exceed the 1MB limit can likely be attributed to Kafka cluster state that was temporarily non-functional.
Splunk shows two distinct episodes where sub-threshold messages were dropped:


2018-Jan-30 T 07:30-08:15 UTC


2018-Feb-25 T 06:00-6:30 UTC


Both of these episodes correspond to unusual and problematic cluster state. Looking at them together as a pair of data points provides more confidence in our conclusion than could be derrived had we to rely on only one episode.
Episode 1.

The first episode takes place in the direct aftermath of a deploy that likely put the cluster in a non-operational state.
(Note: "the cluster" here refers to the EU cluster)
The changed being deployed was a Kafka broker configuration update that required server restart.
In a rolling restart, each host is taken down and restored one-at-a-time.
Care is taken to ensure that a subsequent host is not pulled out until the cluster has fully re-integrated its predecessor in the deploy.
By contrast, this deploy was rapid and didn't ensure that the cluster had recovered to a "normal" state between host restarts.
It's reasonable to suppose that at some point, two or even three hosts were non-functional at some point.
That means it would have been impossible for the cluster to elect a leader.
Under that circumstance, it's reasonable to expect that we might see the "Leader not found" error message that we did.
If necessary, we can read through the logs on each machine to determine exactly what happened and fully validate this theory.
Episode 2.

The second episode corresponds to what Nathan Grennan described on Slack as "zoo3 going down".
This was also a cluster issue on EU. In this case, zoo3 "mysteriously stopped", causing a "flood of alerts".
Again, we can go back to the logs to determine the exact nature of the failure (for instance, if zoo3 was leader at the time of the failure). But given the fact that enough alerts were triggered to wake Nathan up at 01:00 with a "flood", it's likely we can explain the "Leader not found" error on this.
Are our producers brittle in the face of cluster changes?

For example, if a node is taken out of the cluster, is a Sarama producer able to continue? Mark suspects not...

Recommendation

Increasing time spend retrying

Since these dropped messages were the result of an unavailable cluster, preventing message loss entails more than our current retry support (default Sarama config: 3 retries with 100ms backoff between them).
That means working on https://parsable.atlassian.net/browse/PA-14363.
Actually, I don't recommend this anymore.
The prior plan was more or less:
When the cluster is in a bad state, reduce lost messages by increasing time spent retrying

But it turns out that this could cause us a lot of increased latency and reduced throughput.
Suppose we did in fact change the retry frequency and count such that the retry state would persist until the cluster returned to normal.
This would likely involve significantly increasing the backoff between retries, otherwise we run the risk of overloading our brokers.
At present the backoff is 250 milliseconds. Let's suppose we think a "safe" value is something like 10 to 60 seconds.
It turns out that production Kafka producers often enter states where they have to retry messages at a regular rate for periods of minutes to hours. During them, a message stating state change to [retrying] is present in logs at a rate on the order of one per minute. I don't believe they're severe enough to cause bad user experience, or to trigger alarms. But I haven't taken the time to correlate the periods with Pager Duty alerts, so I can't be certain.
On about 10% of the last 90 days, producers fell into one of these spells.
At present, the backoff between retry attempts is 250ms. That rapid retry rate may be what is preventing this phenomenon from having an effect outside the cluster. If we were to multiply that backoff by 100 (close to the average of the 10-60 seconds I cited above) it could lead to noticable issues.
Before I issue another recommendation, I want to be certain that the apparently dropped messages weren't properly delivered on subsequent retries.
I wasn't able to find a way to tell which messages had or hadn't been successfully accepted by the brokers.

Misc. Notes


All error messages returned by Sarama the Kafka library were code number 5. This corresponds to Leader not found.