aronasorman/crashing-nodes.md

## crashing-nodes.md

      
    Raw
  

              crashing-nodes.md
            
          
    Our company uses Kubernetes to deploy contentworkshop.learningequality.org. However when we deployed our new code, it got stalled, and one of the pods had a NodeLost status. Uh oh!
I checked the status of the nodes by running kubectl get nodes:
NAME                                                  STATUS     AGE       VERSION
gke-contentworkshop-cent-default-pool-827dd3f8-4h8z   Ready      66d       v1.6.2
gke-contentworkshop-cent-default-pool-827dd3f8-vs6s   NotReady   66d       v1.6.2

Indeed, looks like one of our nodes arrived in a weird state, and is not reporting anything to the kubernetes master.
I first start by running kubectl cordon on the node in question, which tells Kubernetes not to schedule anything on that node, and drains all of the pods away from that node:
$ cordon gke-contentworkshop-cent-default-pool-827dd3f8-vs6s
node "gke-contentworkshop-cent-default-pool-827dd3f8-vs6s" cordoned
In the meantime, I also increase the number of nodes on our system through the Google Cloud interface. With that and the previous cordon run, we now have this state:
$ kubectl get nodes
NAME                                                  STATUS                        AGE       VERSION
gke-contentworkshop-cent-default-pool-827dd3f8-4h8z   Ready                         66d       v1.6.2
gke-contentworkshop-cent-default-pool-827dd3f8-jkw4   Ready                         1m        v1.6.2
gke-contentworkshop-cent-default-pool-827dd3f8-vs6s   NotReady,SchedulingDisabled   66d       v1.6.2
However weirdly enough, some of my pods are still in an unknown state:
$ kubectl get pods
NAME                                              READY     STATUS     RESTARTS   AGE
dd-agent-9gwhs                                    1/1       Running    0          1m
dd-agent-f0lfx                                    1/1       NodeLost   0          66d
dd-agent-xxcxp                                    1/1       Running    0          66d
develop-blue-6-app-deployment-3402908930-052h7    3/3       Running    0          10m
develop-blue-6-app-deployment-3402908930-6lqqh    3/3       Running    0          9m
develop-blue-6-app-deployment-3402908930-7l05h    0/3       Unknown    0          29m
develop-blue-6-app-deployment-3402908930-wf6gl    3/3       Running    0          10m
develop-blue-6-app-deployment-3402908930-xtqkl    0/3       Unknown    0          29m
develop-green-6-app-deployment-1838893913-bb9s2   2/3       Unknown    0          1d
...
Eventually though, Google power cycles the nonworking node, and since we already have some slack in the system, all pods are running again. Yey!