Larger (more than 15 nodes booting up at once) deployments create a large number of quarantines / pod reboots that should be unnecessary in a clean deployment environment that isn't experiencing extremely high CPU utilization or bandwidth saturation.
So what could be causing this?
- Default /system messages into Akka.Remote / Akka.Cluster are getting dropped during the joining process;
- The failure-detector settings in Akka.Cluster too sensitive, this creates a large number of false positive unreachable events, which in turns breaks TCP connectivity, which creates quarantines;
- Bug in the failure-detector implementation itself that only becomes exposed once the node count greately exceeds the
nr-monitored-by
values; - Could be problems with K8s pod readiness / liveness probes that prematurely cut out service-level DNS support for starting pods; and
- Bug in the Akka.Cluster code that appears to trigger quarantines - as far as I know, there shouldn't be any /system messages sent at startup using vanilla Akka.Remote / Akka.Cluster.
Things that should be fixed anyway:
- Need to clarify Akka.Remote disassociation reasons (i.e. TCP connection refused, TCP connection blown up, failure detector hit, etc) - they are extremely cryptic right now;
- Add Unreachability reasons - this might be something we can clean up via logging - ideally we shouldn't have to touch the gossip.