toddlipcon/split-brain.md

## split-brain.md

      
    Raw
  

              split-brain.md
            
          
    An interesting question

This morning, one of my colleagues asked me the following interesting question:

So, I have been wondering about this question (which is not necessarily applicable to Kudu): when certain software says they use a single hot replica for failover, how does it handle confusion caused by transient network failure which breaks the communication between the primary and second ? In other words, how can such configuration guarantee that both nodes won't think of itself as the primary?
Do they usually fall back to some third party as an arbitrator ? However, that third party may itself suffer random network partition with one or either of the nodes

This led to my brain-dumping for 20 minutes on Slack, which I then figured I'd copy-paste into a "blog post" gist.
My brain-dump

Yea, that's a general problem -- split brain. Generally I think there are a couple approaches:
(1) cross your fingers that it doesn't happen

If it does, provide some way for the administrator to investigate which transactions/operations might have happened on both sides of the aprtition and manually reconcile/adjust. For example, if you can tell from the logs that the standby took over at txnid 12345, you can dump WALs starting at that transaction from both sides, and do a bunch of manual cleanup. This might be reasonable for low-throughput or low-criticality services.
(2) resource fencing

eg if you have two DB servers, you can have them both mount a SAN device which supports a fence operation.
The fence operation says "from this point on, remove access from the other guy".
If the DB itself is on that SAN that would prevent the old active from writing any operations
For preventing stale reads you either rely on a time-based lease, or making reads actually write to the shared resource.
Note that the "resource" here can itself be implemented by a more robust system that avoids split brain. This is the solution we use in HDFS for example -- the QuorumJournalManager nodes assign writers an epoch number, and when a writer with epoch N starts writing, we prohibit writes from anyone <N
Your idea of "third party as arbitrator" is more or less the "resource fencing" approach -- eg you can use ZK as that third party, but it can be a little dicey unless you put ZK itself in the path of every operation.
(3) STONITH (shoot the other node in the head)

When someone becomes active, they have a record of the previous active, and they use some external functionality to ensure that the old node is "shot in the head". For example, you could ssh into a node and kill a daemon, or you could use IPMI or something to power-cycle the machine. In cloud, you could ask ec2 to terminate the instance, etc.
Before we implemented QuorumJournalManager in HDFS, we used this approach, along with putting the edit log itself on an NFS mount.
Note: sometimes the "external functionality" can be "tell an admin to shoot the other node".
General thoughts

On a more general note, I think it's always worth looking at the operations the system supports and being realistic about what the actual required guarantees are. Is it OK if there are two nodes who think they are active for some window of time? Perhaps it's OK to serve stale reads for some amount of time but not commit writes? Or there might be some types of writes which woudl be OK to lose and others that need stronger guarantees, etc
For example:

the HDFS NN can serve stale reads for some number of seconds before it realizes it has been superceded
the kudu master can similarly perform various cluster operations like re-replicating out-of-date replicas, and we've just made sure that the operations themselves are robust to conflicts. If you have two active masters at the same time, they won't conflict in a way that produces system-wide issues like lost data. You might have some weirdness where things get out-of-balance or whatever, or maybe even flapping like one node trying to add a replica while the other thinks it's not necessary. But if you make an assumption that the split brain is temporary in nature, then those kinds of transient "weirdnesses" are fine so long as they don't introduce outages/dataloss/whatever

I think Impala admission control is a good example -- if you have multiple coordinators, they can over-admit because they aren't aware of the most recent admissions of the other ones. if we moved to a "single admission controller with failover", it would probably still be fine to have split-brain regress back to our current state, so long as we thought it was relatively temporary in nature.
Again following that example, this is one of those cases where the "resource itself" (the impalad executors) can be in charge of enforcing single-master. When an admission controller becomes active, it can get assigned itself an epoch number, and executors could enforce monotonicitity when the actual fragment is executed. If an impalad gets assigned work by admission controller epoch=10, then sees a request that was scheduled by epoch 9, it could probably reject it and force it to retry
The datanodes in HDFS do something like this when processing commands like deletion/replication work assigned by NNs -- if it handled a command from an NN with epoch N, it won't listen to any commands from NN with earlier epoch.
... and that's all I know about HA.