When developing my first hashicorp/raft implementation I was stuck at a stone wall it seemed. Somehow I could bootstrap one of the nodes and this would win the election and the raft observation for it becoming Leader also made it in, but then nothing. I couldn't figure out why all RPCs timed out after that.
I started to run locally and debugged using VS Code Launch configuration (which uses Delve) and found that before bootstrapping there was a "runCandidate" goroutine, but after bootstrapping there was no "runLeader" goroutine. Turns out the leader loop never started! How could this be? Then my mind wandered to the observer that I added:
ch := make(chan raft.Observation, 1)
r.RegisterObserver(raft.NewObserver(ch, true, func(o *raft.Observation) bool {
// *RequestVoteRequest
// RaftState
// PeerObservation
// LeaderObservation
data, _ := json.Marshal(o.Data)
switch v := o.Data.(type) {
case raft.RaftState:
zap.L().Info("raft observation", zap.String("state", v.String()))
default:
zap.L().Info("raft observation", zap.ByteString("json", data))
}
setHealth()
return true
}))
This jewel was added to diagnose the raft process, and to update the server healthcheck status: if the Raft state becomes Shutdown the healthcheck fails to ensure that pods are terminated.
However, the argument true
means the Observer is blocking, and it will simply halt the process if the channel is not read!
So the fix was to either:
- read the channel,
- make the observer non-blocking,
- return false from the filter function
go func() {
for {
<-ch
}
}()