Skip to content

Instantly share code, notes, and snippets.

@ipedrazas
Created May 24, 2018 10:17
Show Gist options
  • Save ipedrazas/5c75f56717b7b2f1cf7fa6bc9c94ad3e to your computer and use it in GitHub Desktop.
Save ipedrazas/5c75f56717b7b2f1cf7fa6bc9c94ad3e to your computer and use it in GitHub Desktop.
Debugging issues in GKE

Pods have a big number of restarts

NAME                                     READY     STATUS    RESTARTS   AGE
details-v1-6798fccf5f-t7zqc              2/2       Running   0          22h
productpage-v1-5f7b97679-gn2js           2/2       Running   190        22h
ratings-v1-5675c99f79-66c96              2/2       Running   0          22h
reviews-v1-586cb488f9-cjxxz              2/2       Running   190        17h
reviews-v2-67ccbd89c7-4jgcr              2/2       Running   242        1d
reviews-v3-6fd9fddb9f-mczvf              2/2       Running   190        17h
staging-nc-nutcracker-579b75498c-lg5p7   2/2       Running   364        1d
xx-homepage-7f97cb6cdf-l5c8s             2/2       Running   190        17h
xx-homepage-7f97cb6cdf-r7cnx             2/2       Running   242        1d
xx-homepage-7f97cb6cdf-wv4lj             2/2       Running   190        17h

Logs from one of those pods:

  Normal   SuccessfulMountVolume  1m                 kubelet, gke-lab4-default-pool-0eb9f919-qdxm  MountVolume.SetUp succeeded for volume "istio-envoy"
  Normal   SuccessfulMountVolume  1m                 kubelet, gke-lab4-default-pool-0eb9f919-qdxm  MountVolume.SetUp succeeded for volume "default-token-92zgx"
  Warning  NetworkNotReady        1m (x3 over 1m)    kubelet, gke-lab4-default-pool-0eb9f919-qdxm  network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
  Normal   SuccessfulMountVolume  1m                 kubelet, gke-lab4-default-pool-0eb9f919-qdxm  MountVolume.SetUp succeeded for volume "istio-certs"
  Normal   SandboxChanged         1m                 kubelet, gke-lab4-default-pool-0eb9f919-qdxm  Pod sandbox changed, it will be killed and re-created.
  Normal   Started                1m                 kubelet, gke-lab4-default-pool-0eb9f919-qdxm  Started container
  Normal   Created                1m                 kubelet, gke-lab4-default-pool-0eb9f919-qdxm  Created container
  Normal   Pulled                 1m                 kubelet, gke-lab4-default-pool-0eb9f919-qdxm  Container image "docker.io/istio/proxy_init:0.7.1" already present on machine
  Normal   Pulled                 1m                 kubelet, gke-lab4-default-pool-0eb9f919-qdxm  Container image "alpine" already present on machine
  Normal   Created                1m                 kubelet, gke-lab4-default-pool-0eb9f919-qdxm  Created container
  Normal   Pulling                1m                 kubelet, gke-lab4-default-pool-0eb9f919-qdxm  pulling image "ipedrazas/multicluster:v0.4"
  Normal   Started                1m                 kubelet, gke-lab4-default-pool-0eb9f919-qdxm  Started container
  Normal   Pulled                 1m                 kubelet, gke-lab4-default-pool-0eb9f919-qdxm  Successfully pulled image "ipedrazas/multicluster:v0.4"
  Normal   Created                1m                 kubelet, gke-lab4-default-pool-0eb9f919-qdxm  Created container
  Normal   Started                1m                 kubelet, gke-lab4-default-pool-0eb9f919-qdxm  Started container
  Normal   Pulled                 1m                 kubelet, gke-lab4-default-pool-0eb9f919-qdxm  Container image "docker.io/istio/proxy:0.7.1" already present on machine
  Normal   Created                1m                 kubelet, gke-lab4-default-pool-0eb9f919-qdxm  Created container
  Normal   Started                1m                 kubelet, gke-lab4-default-pool-0eb9f919-qdxm  Started container

There are 2 messages that are a bit weird:

  Warning  NetworkNotReady        1m (x3 over 1m)    kubelet, gke-lab4-default-pool-0eb9f919-qdxm  network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
  Normal   SandboxChanged         1m                 kubelet, gke-lab4-default-pool-0eb9f919-qdxm  Pod sandbox changed, it will be killed and re-created.

It seems related to:

  • node not having enough resources
-> %  pods -owide
NAME                                     READY     STATUS        RESTARTS   AGE       IP            NODE
details-v1-6798fccf5f-t7zqc              2/2       Running       0          23h       10.8.10.5     gke-lab4-default-pool-0eb9f919-j5hm
productpage-v1-5f7b97679-26pzf           2/2       Running       0          56s       10.8.11.7     gke-lab4-default-pool-0eb9f919-w60m
productpage-v1-5f7b97679-gn2js           0/2       Terminating   194        23h       10.8.9.237    gke-lab4-default-pool-0eb9f919-qdxm
ratings-v1-5675c99f79-66c96              2/2       Running       0          23h       10.8.11.4     gke-lab4-default-pool-0eb9f919-w60m
reviews-v1-586cb488f9-cjxxz              0/2       Error         194        18h       10.8.13.240   gke-lab4-default-pool-0eb9f919-bn5z
reviews-v2-67ccbd89c7-4jgcr              2/2       Running       248        1d        10.8.7.103    gke-lab4-default-pool-0eb9f919-0qv0
reviews-v3-6fd9fddb9f-mczvf              0/2       Error         194        18h       10.8.13.243   gke-lab4-default-pool-0eb9f919-bn5z
staging-nc-nutcracker-579b75498c-lg5p7   2/2       Running       373        1d        10.8.7.96     gke-lab4-default-pool-0eb9f919-0qv0
xx-homepage-7f97cb6cdf-l5c8s             0/2       Error         194        18h       10.8.13.242   gke-lab4-default-pool-0eb9f919-bn5z
xx-homepage-7f97cb6cdf-r7cnx             2/2       Running       248        1d        10.8.7.107    gke-lab4-default-pool-0eb9f919-0qv0
xx-homepage-7f97cb6cdf-wv4lj             2/2       Running       194        18h       10.8.9.235    gke-lab4-default-pool-0eb9f919-qdxm

All the pods in the same node are killed at the same time. Points to a node issue

reviews-v1-586cb488f9-cjxxz              0/2       Error         194        18h       10.8.13.240   gke-lab4-default-pool-0eb9f919-bn5z
reviews-v3-6fd9fddb9f-mczvf              0/2       Error         194        18h       10.8.13.243   gke-lab4-default-pool-0eb9f919-bn5z
xx-homepage-7f97cb6cdf-l5c8s             0/2       Error         194        18h       10.8.13.242   gke-lab4-default-pool-0eb9f919-bn5z

Wonder what will happen when mchines are recycled (1h to go):

-> %  nodes -owide
NAME                                  STATUS    ROLES     AGE       VERSION         EXTERNAL-IP      OS-IMAGE                             KERNEL-VERSION   CONTAINER-RUNTIME
gke-lab4-default-pool-0eb9f919-0qv0   Ready     <none>    23h       v1.10.2-gke.1   35.187.15.126    Container-Optimized OS from Google   4.14.22+         docker://17.3.2
gke-lab4-default-pool-0eb9f919-bn5z   Ready     <none>    21h       v1.10.2-gke.1   35.195.192.172   Container-Optimized OS from Google   4.14.22+         docker://17.3.2
gke-lab4-default-pool-0eb9f919-j5hm   Ready     <none>    23h       v1.10.2-gke.1   35.205.184.102   Container-Optimized OS from Google   4.14.22+         docker://17.3.2
gke-lab4-default-pool-0eb9f919-qdxm   Ready     <none>    23h       v1.10.2-gke.1   35.233.89.91     Container-Optimized OS from Google   4.14.22+         docker://17.3.2
gke-lab4-default-pool-0eb9f919-w60m   Ready     <none>    23h       v1.10.2-gke.1   35.233.97.202    Container-Optimized OS from Google   4.14.22+         docker://17.3.2

``

* Going to GCP and deleting the node has no effect, pods keeps dying.
* Last time we had this problem we fixed it by modifying the resources of the deployed pods. Not sure why resources are fine until they're not. It feels more like a combination of resources available in the node and resources used by pods. Noisy neighbour?
@ipedrazas
Copy link
Author

Once the pod is scheduled in a node that has no issues, does not crash anymore:

-> %  pods -owide
NAME                                     READY     STATUS    RESTARTS   AGE       IP           NODE
details-v1-6798fccf5f-t7zqc              2/2       Running   0          23h       10.8.10.5    gke-lab4-default-pool-0eb9f919-j5hm
productpage-v1-5f7b97679-26pzf           2/2       Running   0          49m       10.8.11.7    gke-lab4-default-pool-0eb9f919-w60m
ratings-v1-5675c99f79-66c96              2/2       Running   0          23h       10.8.11.4    gke-lab4-default-pool-0eb9f919-w60m
reviews-v1-586cb488f9-cjxxz              2/2       Running   204        19h       10.8.13.10   gke-lab4-default-pool-0eb9f919-bn5z
reviews-v2-67ccbd89c7-4jgcr              2/2       Running   256        2d        10.8.7.157   gke-lab4-default-pool-0eb9f919-0qv0
reviews-v3-6fd9fddb9f-mczvf              2/2       Running   204        19h       10.8.13.7    gke-lab4-default-pool-0eb9f919-bn5z
staging-nc-nutcracker-579b75498c-lg5p7   2/2       Running   385        2d        10.8.7.150   gke-lab4-default-pool-0eb9f919-0qv0
xx-homepage-7f97cb6cdf-l5c8s             2/2       Running   204        19h       10.8.13.8    gke-lab4-default-pool-0eb9f919-bn5z
xx-homepage-7f97cb6cdf-r7cnx             2/2       Running   256        2d        10.8.7.156   gke-lab4-default-pool-0eb9f919-0qv0
xx-homepage-7f97cb6cdf-wv4lj             2/2       Running   202        19h       10.8.9.253   gke-lab4-default-pool-0eb9f919-qdxm

@ipedrazas
Copy link
Author

This is interesting:

Because nodes keep the same name, pods seem to remember that node and are scheduled in the same machine

-> %  pods -owide
NAME                                     READY     STATUS              RESTARTS   AGE       IP           NODE
details-v1-6798fccf5f-t7zqc              0/2       PodInitializing     0          23h       10.8.2.2     gke-lab4-default-pool-0eb9f919-j5hm
productpage-v1-5f7b97679-26pzf           0/2       PodInitializing     0          51m       10.8.0.4     gke-lab4-default-pool-0eb9f919-w60m
ratings-v1-5675c99f79-66c96              0/2       PodInitializing     0          23h       10.8.0.6     gke-lab4-default-pool-0eb9f919-w60m
reviews-v1-586cb488f9-cjxxz              2/2       Running             204        19h       10.8.13.10   gke-lab4-default-pool-0eb9f919-bn5z
reviews-v2-67ccbd89c7-tn5bh              0/2       Init:0/2            0          1m        <none>       gke-lab4-default-pool-0eb9f919-qdxm
reviews-v3-6fd9fddb9f-mczvf              2/2       Running             204        19h       10.8.13.7    gke-lab4-default-pool-0eb9f919-bn5z
staging-nc-nutcracker-579b75498c-nzrcs   0/2       ContainerCreating   0          1m        <none>       gke-lab4-default-pool-0eb9f919-w60m
xx-homepage-7f97cb6cdf-l5c8s             2/2       Running             204        19h       10.8.13.8    gke-lab4-default-pool-0eb9f919-bn5z
xx-homepage-7f97cb6cdf-pdk5r             2/2       Running             0          1m        10.8.13.16   gke-lab4-default-pool-0eb9f919-bn5z
xx-homepage-7f97cb6cdf-wv4lj             0/2       PodInitializing     202        19h       10.8.1.5     gke-lab4-default-pool-0eb9f919-qdxm

@ipedrazas
Copy link
Author

And here we are, another GKE bug... because of the previous comment, pods age is not correct:

-> %  pods -owide
NAME                                     READY     STATUS    RESTARTS   AGE       IP           NODE
details-v1-6798fccf5f-t7zqc              2/2       Running   0          23h       10.8.2.2     gke-lab4-default-pool-0eb9f919-j5hm
productpage-v1-5f7b97679-26pzf           2/2       Running   0          53m       10.8.0.4     gke-lab4-default-pool-0eb9f919-w60m
ratings-v1-5675c99f79-66c96              2/2       Running   0          23h       10.8.0.6     gke-lab4-default-pool-0eb9f919-w60m
reviews-v1-586cb488f9-cjxxz              2/2       Running   204        19h       10.8.13.10   gke-lab4-default-pool-0eb9f919-bn5z
reviews-v2-67ccbd89c7-tn5bh              2/2       Running   0          3m        10.8.1.6     gke-lab4-default-pool-0eb9f919-qdxm
reviews-v3-6fd9fddb9f-mczvf              2/2       Running   204        19h       10.8.13.7    gke-lab4-default-pool-0eb9f919-bn5z
staging-nc-nutcracker-579b75498c-nzrcs   2/2       Running   1          3m        10.8.0.7     gke-lab4-default-pool-0eb9f919-w60m
xx-homepage-7f97cb6cdf-l5c8s             2/2       Running   204        19h       10.8.13.8    gke-lab4-default-pool-0eb9f919-bn5z
xx-homepage-7f97cb6cdf-pdk5r             2/2       Running   0          3m        10.8.13.16   gke-lab4-default-pool-0eb9f919-bn5z
xx-homepage-7f97cb6cdf-wv4lj             2/2       Running   0          19h       10.8.1.5     gke-lab4-default-pool-0eb9f919-qdxm

@ipedrazas
Copy link
Author

Pods crashing again

-> %  pods -owide
NAME                                     READY     STATUS    RESTARTS   AGE       IP           NODE
details-v1-6798fccf5f-t7zqc              2/2       Running   0          1d        10.8.2.2     gke-lab4-default-pool-0eb9f919-j5hm
productpage-v1-5f7b97679-26pzf           2/2       Running   0          1h        10.8.0.4     gke-lab4-default-pool-0eb9f919-w60m
ratings-v1-5675c99f79-66c96              2/2       Running   0          1d        10.8.0.6     gke-lab4-default-pool-0eb9f919-w60m
reviews-v1-586cb488f9-cjxxz              2/2       Running   206        19h       10.8.13.35   gke-lab4-default-pool-0eb9f919-bn5z
reviews-v2-67ccbd89c7-tn5bh              2/2       Running   2          14m       10.8.1.11    gke-lab4-default-pool-0eb9f919-qdxm
reviews-v3-6fd9fddb9f-mczvf              2/2       Running   206        19h       10.8.13.32   gke-lab4-default-pool-0eb9f919-bn5z
staging-nc-nutcracker-579b75498c-nzrcs   2/2       Running   1          14m       10.8.0.7     gke-lab4-default-pool-0eb9f919-w60m
xx-homepage-7f97cb6cdf-l5c8s             2/2       Running   206        19h       10.8.13.34   gke-lab4-default-pool-0eb9f919-bn5z
xx-homepage-7f97cb6cdf-pdk5r             2/2       Running   3          14m       10.8.13.36   gke-lab4-default-pool-0eb9f919-bn5z
xx-homepage-7f97cb6cdf-wv4lj             2/2       Running   2          19h       10.8.1.10    gke-lab4-default-pool-0eb9f919-qdxm
``

@ipedrazas
Copy link
Author

Downgraded the kubernetes version, machines have different names and pods have the

-> %  nodes -owide
NAME                                  STATUS    ROLES     AGE       VERSION        EXTERNAL-IP      OS-IMAGE                             KERNEL-VERSION   CONTAINER-RUNTIME
gke-lab4-default-pool-0eb9f919-9mvs   Ready     <none>    7m        v1.9.7-gke.1   35.233.126.133   Container-Optimized OS from Google   4.4.111+         docker://17.3.2
gke-lab4-default-pool-0eb9f919-bn5z   Ready     <none>    6m        v1.9.7-gke.1   35.233.9.136     Container-Optimized OS from Google   4.4.111+         docker://17.3.2
gke-lab4-default-pool-0eb9f919-j5hm   Ready     <none>    5m        v1.9.7-gke.1   35.195.247.255   Container-Optimized OS from Google   4.4.111+         docker://17.3.2
gke-lab4-default-pool-0eb9f919-qdxm   Ready     <none>    4m        v1.9.7-gke.1   35.195.192.172   Container-Optimized OS from Google   4.4.111+         docker://17.3.2
gke-lab4-default-pool-0eb9f919-w60m   Ready     <none>    3m        v1.9.7-gke.1   35.187.15.126    Container-Optimized OS from Google   4.4.111+         docker://17.3.2
gke-lab4-default-pool-0eb9f919-wpr6   Ready     <none>    8m        v1.9.7-gke.1   35.233.97.202    Container-Optimized OS from Google   4.4.111+         docker://17.3.2

And pods have the right age:

-> %  pods -owide
NAME                                     READY     STATUS    RESTARTS   AGE       IP         NODE
details-v1-6798fccf5f-xl6xr              2/2       Running   0          7m        10.8.4.8   gke-lab4-default-pool-0eb9f919-wpr6
productpage-v1-5f7b97679-hs6s5           2/2       Running   0          5m        10.8.6.4   gke-lab4-default-pool-0eb9f919-bn5z
ratings-v1-5675c99f79-q2s54              2/2       Running   0          5m        10.8.6.3   gke-lab4-default-pool-0eb9f919-bn5z
reviews-v1-586cb488f9-jc6bq              2/2       Running   0          6m        10.8.5.4   gke-lab4-default-pool-0eb9f919-9mvs
reviews-v2-67ccbd89c7-t4fhl              2/2       Running   0          6m        10.8.5.5   gke-lab4-default-pool-0eb9f919-9mvs
reviews-v3-6fd9fddb9f-p98rk              2/2       Running   0          6m        10.8.7.3   gke-lab4-default-pool-0eb9f919-j5hm
staging-nc-nutcracker-579b75498c-jllnc   2/2       Running   1          5m        10.8.6.7   gke-lab4-default-pool-0eb9f919-bn5z
xx-homepage-7f97cb6cdf-6cwnh             2/2       Running   0          6m        10.8.6.9   gke-lab4-default-pool-0eb9f919-bn5z
xx-homepage-7f97cb6cdf-6tdk5             2/2       Running   1          8m        10.8.4.5   gke-lab4-default-pool-0eb9f919-wpr6
xx-homepage-7f97cb6cdf-8pcbj             2/2       Running   1          8m        10.8.4.3   gke-lab4-default-pool-0eb9f919-wpr6

@ipedrazas
Copy link
Author

  • Kubernetes version change: v1.10.2-gke.1 --> v1.9.7-gke.1
  • kernel version cahneg: 4.14.22+ --> 4.4.111+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment