Our company uses Kubernetes to deploy contentworkshop.learningequality.org. However when we deployed our new code, it got stalled, and one of the pods had a NodeLost
status. Uh oh!
I checked the status of the nodes by running kubectl get nodes
:
NAME STATUS AGE VERSION
gke-contentworkshop-cent-default-pool-827dd3f8-4h8z Ready 66d v1.6.2
gke-contentworkshop-cent-default-pool-827dd3f8-vs6s NotReady 66d v1.6.2
Indeed, looks like one of our nodes arrived in a weird state, and is not reporting anything to the kubernetes master.
I first start by running kubectl cordon
on the node in question, which tells Kubernetes not to schedule anything on that node, and drains all of the pods away from that node:
$ cordon gke-contentworkshop-cent-default-pool-827dd3f8-vs6s
node "gke-contentworkshop-cent-default-pool-827dd3f8-vs6s" cordoned
In the meantime, I also increase the number of nodes on our system through the Google Cloud interface. With that and the previous cordon
run, we now have this state:
$ kubectl get nodes
NAME STATUS AGE VERSION
gke-contentworkshop-cent-default-pool-827dd3f8-4h8z Ready 66d v1.6.2
gke-contentworkshop-cent-default-pool-827dd3f8-jkw4 Ready 1m v1.6.2
gke-contentworkshop-cent-default-pool-827dd3f8-vs6s NotReady,SchedulingDisabled 66d v1.6.2
However weirdly enough, some of my pods are still in an unknown state:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
dd-agent-9gwhs 1/1 Running 0 1m
dd-agent-f0lfx 1/1 NodeLost 0 66d
dd-agent-xxcxp 1/1 Running 0 66d
develop-blue-6-app-deployment-3402908930-052h7 3/3 Running 0 10m
develop-blue-6-app-deployment-3402908930-6lqqh 3/3 Running 0 9m
develop-blue-6-app-deployment-3402908930-7l05h 0/3 Unknown 0 29m
develop-blue-6-app-deployment-3402908930-wf6gl 3/3 Running 0 10m
develop-blue-6-app-deployment-3402908930-xtqkl 0/3 Unknown 0 29m
develop-green-6-app-deployment-1838893913-bb9s2 2/3 Unknown 0 1d
...
Eventually though, Google power cycles the nonworking node, and since we already have some slack in the system, all pods are running again. Yey!