Consul: 0.7.2
You may have crashed your cluster so that all Consul servers have been offline at some point. You may be running on Kubernetes. The default 96 hours didn't pass, so there was no reaping of Consul servers. Restarting it all doesn't work. You've read this issue five times over and nothing works. On top of it all, which makes it harder, you're running a StatefulSet on Kubernetes, so you need to do kubectl delete pods/consul-1
to make the container arguments (kubectl replace consul/consul.yml
) bite. On top of this, if you kubectl exec -it consul-1
and then kill -9 5
, Kubernetes goes into a crash loop with exponential backoff, eating into your time.
Sounds like a Friday pleasure, right?
The tools you have at your disposal are:
-bootstrap
kubectl replace consul/consul.yml
-expect-bootstrap=3
consul force-leave <ip>
consul operator raft -remove-peer -address=<ip>:8300
kubectl delete pods/consul-<number>
kubectl logs consul-<number> -f
consul members
echo '["<ip1>:8300", "<ip2>:8300", "<ip3>:8300"]' >/consul/data/raft/peers.json
It's a planning game with a time aspect. You need to make a computer take leadership.
- Add
-bootstrap
to the args of the container kubectl replace consul/consul.yml
to update the scheduling-to-comekubectl delete pods/consul-0 pods/consul-1 pods/consul-2
to make them all restart- They'll complain they can't connect to the old IPs
- Run
echo '["<ip1>:8300", "<ip2>:8300", "<ip3>:8300"]' >/consul/data/raft/peers.json
on one of the nodes - Use up your first restart on that node, with
kill -9 5
where 5 is the child-pid of consul (under the docker process) - It'll come up and start leader election
- Do the same (5-7) for consul-1 and consul-2
- They should now all complain that they're all running in bootstrap mode
- Some of them will try to contact old nodes. Use
consul operator raft -remove-peer -address=<ip>:8300
on those nodes to make them reconsider (force-leave doesn't work since it's a graceful leave but the old machine is gone) - Now you only have complaining about the bootstrap flag left. One machine is leader. Don't touch that machine.
- Edit your consul.yml file, removing
-bootstrap
- Delete the two non-leader pods:
kubectl delete pods/consul-<number>
- Wait until they start again. Use
operator
to remove the old IPs - Verify with
consul members
on the leader – this only lists serf-members, not raft-members. - You should now have a three-node cluster back up without removing any folders.
It's a planning game.