markrmiller/High level leader sync.rst

## High level leader sync.rst

      
    Raw
  

              High level leader sync.rst
            
          
    High level leader sync


Contents

Overseer yuck
Leader Process
LIR
Wait to see if a better leader candidate shows up.
Sync up with replicas
Tell replicas to recover


Overseer yuck


This type of thing should not even be necessary here, but out of
scope.
ShardLeaderElectionContext.java:115

if (zkController.getClusterState().getCollection(collection).getSlice(shardId).getReplicas().size() > 1) {
  // Clear the leader in clusterstate. We only need to worry about this if there is actually more than one replica.
  ZkNodeProps m = new ZkNodeProps(Overseer.QUEUE_OPERATION, OverseerAction.LEADER.toLower(),
      ZkStateReader.SHARD_ID_PROP, shardId, ZkStateReader.COLLECTION_PROP, collection);

  if (distributedClusterStateUpdater.isDistributedStateUpdate()) {
    distributedClusterStateUpdater.doSingleStateUpdate(DistributedClusterStateUpdater.MutatingCommand.SliceSetShardLeader, m,
        zkController.getSolrCloudManager(), zkStateReader);
  } else {
    zkController.getOverseer().getStateUpdateQueue().offer(Utils.toJSON(m));
  }
}

Leader Process


Here we silly wait for all the replicas in a shard to be involved.
Why? At the start we didn't know who should be the leader, who should
have the most up to date data, replicas could be started up in a
staggered sequence, and you can't have the wrong one become leader if
the replica with the most up to date data is going to come up 15
seconds later. Why does it happen currently? No clue.
ShardLeaderElectionContext.java:129

if (!weAreReplacement) {
  waitForReplicasToComeUp(leaderVoteWait);
} else {
  areAllReplicasParticipating();
}

LIR


Here we check if LIR says we should be the leader, and if not ... we
wait again?
ShardLeaderElectionContext.java:153

// should I be leader?
ZkShardTerms zkShardTerms = zkController.getShardTerms(collection, shardId);
if (zkShardTerms.registered(coreNodeName) && !zkShardTerms.canBecomeLeader(coreNodeName)) {
  if (!waitForEligibleBecomeLeaderAfterTimeout(zkShardTerms, coreNodeName, leaderVoteWait)) {
    rejoinLeaderElection(core);
    return;
  } else {
    // only log an error if this replica win the election
    setTermToMax = true;
  }
}

Wait to see if a better leader candidate shows up.


Why are we polling with this half second wait? Should never poll for ZK
state, zkstatereader.waitfor* or some kind of a watcher, same as you
would do wait/notify with a lock vs blind poll.
How long should we wait for joe user to remember they better start a
leader? Given them 5 minutes? 10? Will they beat the deadline? Know
we are waiting on them and the data loss clock is ticking? Or will
the case be 99.999% of the time, the leader is starting within
seconds or not coming and the user is going to sit until timeout
wondering why things are slow and then things will move on with a new
best case leader every time.
We know who the leader should be. It's in LIR. If that node or an
equally up to date node is not live in zk in very little time, that
node is not coming 99.99% of the time. Move on. We know the next most
up to date candidates after that.
ShardLeaderElectionContext.java:324

private boolean waitForEligibleBecomeLeaderAfterTimeout(ZkShardTerms zkShardTerms, String coreNodeName, int timeout) throws InterruptedException {
  long timeoutAt = System.nanoTime() + TimeUnit.NANOSECONDS.convert(timeout, TimeUnit.MILLISECONDS);
  while (!isClosed && !cc.isShutDown()) {
    if (System.nanoTime() > timeoutAt) {
      log.warn("After waiting for {}ms, no other potential leader was found, {} try to become leader anyway (core_term:{}, highest_term:{})",
          timeout, coreNodeName, zkShardTerms.getTerm(coreNodeName), zkShardTerms.getHighestTerm());
      return true;
    }
    if (replicasWithHigherTermParticipated(zkShardTerms, coreNodeName)) {
      log.info("Can't become leader, other replicas with higher term participated in leader election");
      return false;
    }
    Thread.sleep(500L);
  }
  return false;
}

Sync up with replicas


We do this every time, on day one we had no choice. Now, I suppose
you could consider it a super cautious safety thing. I think I've
always still done it because I've got it ridiculously efficient, but
if LIR is solid, we know if we are missing data. We know if someone
with more data is around. If they are around, we have already bailed
on being able to be leader. If they are not around, technically we
still know from LIR if we are most up to date state to be leader. We
really only have to sync as a precautionary measure at most, perhaps
in the case that it seems like someone more up to date is live, but
they are not becomeing leader.
ShardLeaderElectionContext.java:188

boolean success = false;
try {
  result = syncStrategy.sync(zkController, core, leaderProps, weAreReplacement);
  success = result.isSuccess();
} catch (Exception e) {
  SolrException.log(log, "Exception while trying to sync", e);
  result = PeerSync.PeerSyncResult.failure();
}

Tell replicas to recover


This is a list of replicas that didn't succeed when we asked them
to sync with the leader, so tell them to enter recovery. The result
of the sync where the potential leader peer syncs with earch
replica and then asks the replicas to peer sync back. If the
peersync back failed, we better put them into recovery.
A very cautious dance that has impl problems and was the result of
overly cautious hacking when parts where missing. First, syncing up
the shard is likely only even potientially useful in a cautious
move if LIR indicates the leader cannot be found or it has trouble becoming the leader and
you are falling back to a next best candidate. Even then it's a
cautious move to retain every bit of data that might be possible
when that data may not be on more than one replica. It shouldn't be
strictly necessary, and a user should not have been given the warm
and fuzzy they won't lose an update if it only ever hit one replica
anyway. This whole process can be ridiculously more stable and
efficient - but who cares if you are a more practical type - the
leader sync itself is an almost always uneccessary cautious move more
related to days of no LIR and attempts to retain every scrape of
data even if only ack'd by a single replica in a premature world
and no way for a user to understand if there updates went to more
than one replica or not when they got a success. Even then, if LIR actually works, the sync itself should not matter if you know from LIR which nodes are the most up to date. A replica most up to date will become leader, the other replicas will recover against it. The sync itself is unessary and at best super conservative bug protection in the case an update hit a single replica and the expectation is we never lose that update if possible. The replica sync back and this put into recovery stuff? Silly unnessary.


ShardLeaderElectionContext.java: 277


if (log.isInfoEnabled()) {
  log.info("I am the new leader: {} {}", ZkCoreNodeProps.getCoreUrl(leaderProps), shardId);
}

// we made it as leader - send any recovery requests we need to
syncStrategy.requestRecoveries();