Aaronontheweb/theory.md

## theory.md

      
    Raw
  

              theory.md
            
          
    Larger (more than 15 nodes booting up at once) deployments create a large number of quarantines / pod reboots that should be unnecessary in a clean deployment environment that isn't experiencing extremely high CPU utilization or bandwidth saturation.
So what could be causing this?

Default /system messages into Akka.Remote / Akka.Cluster are getting dropped during the joining process;
The failure-detector settings in Akka.Cluster too sensitive, this creates a large number of false positive unreachable events, which in turns breaks TCP connectivity, which creates quarantines;
Bug in the failure-detector implementation itself that only becomes exposed once the node count greately exceeds the nr-monitored-by values;
Could be problems with K8s pod readiness / liveness probes that prematurely cut out service-level DNS support for starting pods; and
Bug in the Akka.Cluster code that appears to trigger quarantines - as far as I know, there shouldn't be any /system messages sent at startup using vanilla Akka.Remote / Akka.Cluster.


Things that should be fixed anyway:

Need to clarify Akka.Remote disassociation reasons (i.e. TCP connection refused, TCP connection blown up, failure detector hit, etc) - they are extremely cryptic right now;
Add Unreachability reasons - this might be something we can clean up via logging - ideally we shouldn't have to touch the gossip.