This morning, one of my colleagues asked me the following interesting question:
So, I have been wondering about this question (which is not necessarily applicable to Kudu): when certain software says they use a single hot replica for failover, how does it handle confusion caused by transient network failure which breaks the communication between the primary and second ? In other words, how can such configuration guarantee that both nodes won't think of itself as the primary? Do they usually fall back to some third party as an arbitrator ? However, that third party may itself suffer random network partition with one or either of the nodes
This led to my brain-dumping for 20 minutes on Slack, which I then figured I'd copy-paste into a "blog post" gist.