Larger (more than 15 nodes booting up at once) deployments create a large number of quarantines / pod reboots that should be unnecessary in a clean deployment environment that isn't experiencing extremely high CPU utilization or bandwidth saturation.
So what could be causing this?
- Default /system messages into Akka.Remote / Akka.Cluster are getting dropped during the joining process;
- The failure-detector settings in Akka.Cluster too sensitive, this creates a large number of false positive unreachable events, which in turns breaks TCP connectivity, which creates quarantines;
- Bug in the failure-detector implementation itself that only becomes exposed once the node count greately exceeds the
nr-monitored-by
values; - Could be problems with K8s pod readiness / liveness probes that prematurely cut out service-level DNS support for starting pods; and
- Bug in the Akka.Cluster code that appears to trigger quarantines - as far as I know, there shouldn't be any /system messages sent at startup using van