Skip to content

Instantly share code, notes, and snippets.

@markrmiller
Last active October 11, 2021 19:33
Show Gist options
  • Save markrmiller/233119ba84ce39d39960de0f35e79fc9 to your computer and use it in GitHub Desktop.
Save markrmiller/233119ba84ce39d39960de0f35e79fc9 to your computer and use it in GitHub Desktop.
High level leader sync

High level leader sync

Overseer yuck

  • This type of thing should not even be necessary here, but out of scope.
  • ShardLeaderElectionContext.java:115
if (zkController.getClusterState().getCollection(collection).getSlice(shardId).getReplicas().size() > 1) {
  // Clear the leader in clusterstate. We only need to worry about this if there is actually more than one replica.
  ZkNodeProps m = new ZkNodeProps(Overseer.QUEUE_OPERATION, OverseerAction.LEADER.toLower(),
      ZkStateReader.SHARD_ID_PROP, shardId, ZkStateReader.COLLECTION_PROP, collection);

  if (distributedClusterStateUpdater.isDistributedStateUpdate()) {
    distributedClusterStateUpdater.doSingleStateUpdate(DistributedClusterStateUpdater.MutatingCommand.SliceSetShardLeader, m,
        zkController.getSolrCloudManager(), zkStateReader);
  } else {
    zkController.getOverseer().getStateUpdateQueue().offer(Utils.toJSON(m));
  }
}

Leader Process

  • Here we silly wait for all the replicas in a shard to be involved. Why? At the start we didn't know who should be the leader, who should have the most up to date data, replicas could be started up in a staggered sequence, and you can't have the wrong one become leader if the replica with the most up to date data is going to come up 15 seconds later. Why does it happen currently? No clue.
  • ShardLeaderElectionContext.java:129
if (!weAreReplacement) {
  waitForReplicasToComeUp(leaderVoteWait);
} else {
  areAllReplicasParticipating();
}

LIR

  • Here we check if LIR says we should be the leader, and if not ... we wait again?
  • ShardLeaderElectionContext.java:153
// should I be leader?
ZkShardTerms zkShardTerms = zkController.getShardTerms(collection, shardId);
if (zkShardTerms.registered(coreNodeName) && !zkShardTerms.canBecomeLeader(coreNodeName)) {
  if (!waitForEligibleBecomeLeaderAfterTimeout(zkShardTerms, coreNodeName, leaderVoteWait)) {
    rejoinLeaderElection(core);
    return;
  } else {
    // only log an error if this replica win the election
    setTermToMax = true;
  }
}

Wait to see if a better leader candidate shows up.

  • Why are we polling with this half second wait? Should never poll for ZK state, zkstatereader.waitfor* or some kind of a watcher, same as you would do wait/notify with a lock vs blind poll.
  • How long should we wait for joe user to remember they better start a leader? Given them 5 minutes? 10? Will they beat the deadline? Know we are waiting on them and the data loss clock is ticking? Or will the case be 99.999% of the time, the leader is starting within seconds or not coming and the user is going to sit until timeout wondering why things are slow and then things will move on with a new best case leader every time.
  • We know who the leader should be. It's in LIR. If that node or an equally up to date node is not live in zk in very little time, that node is not coming 99.99% of the time. Move on. We know the next most up to date candidates after that.
  • ShardLeaderElectionContext.java:324
private boolean waitForEligibleBecomeLeaderAfterTimeout(ZkShardTerms zkShardTerms, String coreNodeName, int timeout) throws InterruptedException {
  long timeoutAt = System.nanoTime() + TimeUnit.NANOSECONDS.convert(timeout, TimeUnit.MILLISECONDS);
  while (!isClosed && !cc.isShutDown()) {
    if (System.nanoTime() > timeoutAt) {
      log.warn("After waiting for {}ms, no other potential leader was found, {} try to become leader anyway (core_term:{}, highest_term:{})",
          timeout, coreNodeName, zkShardTerms.getTerm(coreNodeName), zkShardTerms.getHighestTerm());
      return true;
    }
    if (replicasWithHigherTermParticipated(zkShardTerms, coreNodeName)) {
      log.info("Can't become leader, other replicas with higher term participated in leader election");
      return false;
    }
    Thread.sleep(500L);
  }
  return false;
}

Sync up with replicas

  • We do this every time, on day one we had no choice. Now, I suppose you could consider it a super cautious safety thing. I think I've always still done it because I've got it ridiculously efficient, but if LIR is solid, we know if we are missing data. We know if someone with more data is around. If they are around, we have already bailed on being able to be leader. If they are not around, technically we still know from LIR if we are most up to date state to be leader. We really only have to sync as a precautionary measure at most, perhaps in the case that it seems like someone more up to date is live, but they are not becomeing leader.
  • ShardLeaderElectionContext.java:188
boolean success = false;
try {
  result = syncStrategy.sync(zkController, core, leaderProps, weAreReplacement);
  success = result.isSuccess();
} catch (Exception e) {
  SolrException.log(log, "Exception while trying to sync", e);
  result = PeerSync.PeerSyncResult.failure();
}

Tell replicas to recover

  • This is a list of replicas that didn't succeed when we asked them to sync with the leader, so tell them to enter recovery. The result of the sync where the potential leader peer syncs with earch replica and then asks the replicas to peer sync back. If the peersync back failed, we better put them into recovery.
    A very cautious dance that has impl problems and was the result of overly cautious hacking when parts where missing. First, syncing up the shard is likely only even potientially useful in a cautious move if LIR indicates the leader cannot be found or it has trouble becoming the leader and you are falling back to a next best candidate. Even then it's a cautious move to retain every bit of data that might be possible when that data may not be on more than one replica. It shouldn't be strictly necessary, and a user should not have been given the warm and fuzzy they won't lose an update if it only ever hit one replica anyway. This whole process can be ridiculously more stable and efficient - but who cares if you are a more practical type - the leader sync itself is an almost always uneccessary cautious move more related to days of no LIR and attempts to retain every scrape of data even if only ack'd by a single replica in a premature world and no way for a user to understand if there updates went to more than one replica or not when they got a success. Even then, if LIR actually works, the sync itself should not matter if you know from LIR which nodes are the most up to date. A replica most up to date will become leader, the other replicas will recover against it. The sync itself is unessary and at best super conservative bug protection in the case an update hit a single replica and the expectation is we never lose that update if possible. The replica sync back and this put into recovery stuff? Silly unnessary.
  • ShardLeaderElectionContext.java: 277
if (log.isInfoEnabled()) {
  log.info("I am the new leader: {} {}", ZkCoreNodeProps.getCoreUrl(leaderProps), shardId);
}

// we made it as leader - send any recovery requests we need to
syncStrategy.requestRecoveries();
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment