Skip to content

Instantly share code, notes, and snippets.

@Aaronontheweb
Created March 29, 2021 14:15
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Aaronontheweb/66095c9340437c0576cf55876d65c1f7 to your computer and use it in GitHub Desktop.
Save Aaronontheweb/66095c9340437c0576cf55876d65c1f7 to your computer and use it in GitHub Desktop.
Akka.Cluster troubleshooting issues

Larger (more than 15 nodes booting up at once) deployments create a large number of quarantines / pod reboots that should be unnecessary in a clean deployment environment that isn't experiencing extremely high CPU utilization or bandwidth saturation.

So what could be causing this?

  1. Default /system messages into Akka.Remote / Akka.Cluster are getting dropped during the joining process;
  2. The failure-detector settings in Akka.Cluster too sensitive, this creates a large number of false positive unreachable events, which in turns breaks TCP connectivity, which creates quarantines;
  3. Bug in the failure-detector implementation itself that only becomes exposed once the node count greately exceeds the nr-monitored-by values;
  4. Could be problems with K8s pod readiness / liveness probes that prematurely cut out service-level DNS support for starting pods; and
  5. Bug in the Akka.Cluster code that appears to trigger quarantines - as far as I know, there shouldn't be any /system messages sent at startup using vanilla Akka.Remote / Akka.Cluster.

Things that should be fixed anyway:

  1. Need to clarify Akka.Remote disassociation reasons (i.e. TCP connection refused, TCP connection blown up, failure detector hit, etc) - they are extremely cryptic right now;
  2. Add Unreachability reasons - this might be something we can clean up via logging - ideally we shouldn't have to touch the gossip.
@wesselkranenborg
Copy link

wesselkranenborg commented Mar 29, 2021

akka {
    remote {
        dot-netty.tcp {
            maximum-frame-size = 2MB
            send-buffer-size = 2MB
            receive-buffer-size = 2MB
        }
    }
    cluster {
        failure-detector {
            # How often keep-alive heartbeat messages should be sent to each connection.
            heartbeat-interval = 1 s # Recommendation is to keep this low but increase the 'acceptable-heartbeat-pause' so that you get more retries before it fails.

            # Defines the failure detector threshold.
            # A low threshold is prone to generate many wrong suspicions but ensures
            # a quick detection in the event of a real crash. Conversely, a high
            # threshold generates fewer mistakes but needs more time to detect
            # actual crashes.
            threshold = 16 #Default: 8.0

            # Number of the samples of inter-heartbeat arrival times to adaptively
            # calculate the failure timeout for connections.
            max-sample-size = 1000

            # Minimum standard deviation to use for the normal distribution in
            # AccrualFailureDetector. Too low standard deviation might result in
            # too much sensitivity for sudden, but normal, deviations in heartbeat
            # inter arrival times.
            min-std-deviation = 100 ms

            # Number of potentially lost/delayed heartbeats that will be
            # accepted before considering it to be an anomaly.
            # This margin is important to be able to survive sudden, occasional,
            # pauses in heartbeat arrivals, due to for example garbage collect or
            # network drop.
            acceptable-heartbeat-pause = 30 s #Default: 3 s

            # Number of member nodes that each member will send heartbeat messages to,
            # i.e. each node will be monitored by this number of other nodes.
            monitored-by-nr-of-members = 9 #Default: 9

            # After the heartbeat request has been sent the first failure detection
            # will start after this period, even though no heartbeat mesage has
            # been received.
            expected-response-after = 15 s #Default: 1 s
        }
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment