Skip to content

Instantly share code, notes, and snippets.

@aronasorman
Created July 14, 2017 18:20
Show Gist options
  • Save aronasorman/d3d69fd44a3ebb35c8db01c2db86e70d to your computer and use it in GitHub Desktop.
Save aronasorman/d3d69fd44a3ebb35c8db01c2db86e70d to your computer and use it in GitHub Desktop.
My devlog on cordoning off nonworking nodes, using Kubernetes

Our company uses Kubernetes to deploy contentworkshop.learningequality.org. However when we deployed our new code, it got stalled, and one of the pods had a NodeLost status. Uh oh!

I checked the status of the nodes by running kubectl get nodes:

NAME                                                  STATUS     AGE       VERSION
gke-contentworkshop-cent-default-pool-827dd3f8-4h8z   Ready      66d       v1.6.2
gke-contentworkshop-cent-default-pool-827dd3f8-vs6s   NotReady   66d       v1.6.2

Indeed, looks like one of our nodes arrived in a weird state, and is not reporting anything to the kubernetes master.

I first start by running kubectl cordon on the node in question, which tells Kubernetes not to schedule anything on that node, and drains all of the pods away from that node:

$ cordon gke-contentworkshop-cent-default-pool-827dd3f8-vs6s
node "gke-contentworkshop-cent-default-pool-827dd3f8-vs6s" cordoned

In the meantime, I also increase the number of nodes on our system through the Google Cloud interface. With that and the previous cordon run, we now have this state:

$ kubectl get nodes
NAME                                                  STATUS                        AGE       VERSION
gke-contentworkshop-cent-default-pool-827dd3f8-4h8z   Ready                         66d       v1.6.2
gke-contentworkshop-cent-default-pool-827dd3f8-jkw4   Ready                         1m        v1.6.2
gke-contentworkshop-cent-default-pool-827dd3f8-vs6s   NotReady,SchedulingDisabled   66d       v1.6.2

However weirdly enough, some of my pods are still in an unknown state:

$ kubectl get pods
NAME                                              READY     STATUS     RESTARTS   AGE
dd-agent-9gwhs                                    1/1       Running    0          1m
dd-agent-f0lfx                                    1/1       NodeLost   0          66d
dd-agent-xxcxp                                    1/1       Running    0          66d
develop-blue-6-app-deployment-3402908930-052h7    3/3       Running    0          10m
develop-blue-6-app-deployment-3402908930-6lqqh    3/3       Running    0          9m
develop-blue-6-app-deployment-3402908930-7l05h    0/3       Unknown    0          29m
develop-blue-6-app-deployment-3402908930-wf6gl    3/3       Running    0          10m
develop-blue-6-app-deployment-3402908930-xtqkl    0/3       Unknown    0          29m
develop-green-6-app-deployment-1838893913-bb9s2   2/3       Unknown    0          1d
...

Eventually though, Google power cycles the nonworking node, and since we already have some slack in the system, all pods are running again. Yey!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment