ipedrazas/debug.md

## debug.md

      
    Raw
  

              debug.md
            
          
    Pods have a big number of restarts
NAME                                     READY     STATUS    RESTARTS   AGE
details-v1-6798fccf5f-t7zqc              2/2       Running   0          22h
productpage-v1-5f7b97679-gn2js           2/2       Running   190        22h
ratings-v1-5675c99f79-66c96              2/2       Running   0          22h
reviews-v1-586cb488f9-cjxxz              2/2       Running   190        17h
reviews-v2-67ccbd89c7-4jgcr              2/2       Running   242        1d
reviews-v3-6fd9fddb9f-mczvf              2/2       Running   190        17h
staging-nc-nutcracker-579b75498c-lg5p7   2/2       Running   364        1d
xx-homepage-7f97cb6cdf-l5c8s             2/2       Running   190        17h
xx-homepage-7f97cb6cdf-r7cnx             2/2       Running   242        1d
xx-homepage-7f97cb6cdf-wv4lj             2/2       Running   190        17h

Logs from one of those pods:
  Normal   SuccessfulMountVolume  1m                 kubelet, gke-lab4-default-pool-0eb9f919-qdxm  MountVolume.SetUp succeeded for volume "istio-envoy"
  Normal   SuccessfulMountVolume  1m                 kubelet, gke-lab4-default-pool-0eb9f919-qdxm  MountVolume.SetUp succeeded for volume "default-token-92zgx"
  Warning  NetworkNotReady        1m (x3 over 1m)    kubelet, gke-lab4-default-pool-0eb9f919-qdxm  network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
  Normal   SuccessfulMountVolume  1m                 kubelet, gke-lab4-default-pool-0eb9f919-qdxm  MountVolume.SetUp succeeded for volume "istio-certs"
  Normal   SandboxChanged         1m                 kubelet, gke-lab4-default-pool-0eb9f919-qdxm  Pod sandbox changed, it will be killed and re-created.
  Normal   Started                1m                 kubelet, gke-lab4-default-pool-0eb9f919-qdxm  Started container
  Normal   Created                1m                 kubelet, gke-lab4-default-pool-0eb9f919-qdxm  Created container
  Normal   Pulled                 1m                 kubelet, gke-lab4-default-pool-0eb9f919-qdxm  Container image "docker.io/istio/proxy_init:0.7.1" already present on machine
  Normal   Pulled                 1m                 kubelet, gke-lab4-default-pool-0eb9f919-qdxm  Container image "alpine" already present on machine
  Normal   Created                1m                 kubelet, gke-lab4-default-pool-0eb9f919-qdxm  Created container
  Normal   Pulling                1m                 kubelet, gke-lab4-default-pool-0eb9f919-qdxm  pulling image "ipedrazas/multicluster:v0.4"
  Normal   Started                1m                 kubelet, gke-lab4-default-pool-0eb9f919-qdxm  Started container
  Normal   Pulled                 1m                 kubelet, gke-lab4-default-pool-0eb9f919-qdxm  Successfully pulled image "ipedrazas/multicluster:v0.4"
  Normal   Created                1m                 kubelet, gke-lab4-default-pool-0eb9f919-qdxm  Created container
  Normal   Started                1m                 kubelet, gke-lab4-default-pool-0eb9f919-qdxm  Started container
  Normal   Pulled                 1m                 kubelet, gke-lab4-default-pool-0eb9f919-qdxm  Container image "docker.io/istio/proxy:0.7.1" already present on machine
  Normal   Created                1m                 kubelet, gke-lab4-default-pool-0eb9f919-qdxm  Created container
  Normal   Started                1m                 kubelet, gke-lab4-default-pool-0eb9f919-qdxm  Started container

There are 2 messages that are a bit weird:
  Warning  NetworkNotReady        1m (x3 over 1m)    kubelet, gke-lab4-default-pool-0eb9f919-qdxm  network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: Kubenet does not have netConfig. This is most likely due to lack of PodCIDR]
  Normal   SandboxChanged         1m                 kubelet, gke-lab4-default-pool-0eb9f919-qdxm  Pod sandbox changed, it will be killed and re-created.

It seems related to:

node not having enough resources

-> %  pods -owide
NAME                                     READY     STATUS        RESTARTS   AGE       IP            NODE
details-v1-6798fccf5f-t7zqc              2/2       Running       0          23h       10.8.10.5     gke-lab4-default-pool-0eb9f919-j5hm
productpage-v1-5f7b97679-26pzf           2/2       Running       0          56s       10.8.11.7     gke-lab4-default-pool-0eb9f919-w60m
productpage-v1-5f7b97679-gn2js           0/2       Terminating   194        23h       10.8.9.237    gke-lab4-default-pool-0eb9f919-qdxm
ratings-v1-5675c99f79-66c96              2/2       Running       0          23h       10.8.11.4     gke-lab4-default-pool-0eb9f919-w60m
reviews-v1-586cb488f9-cjxxz              0/2       Error         194        18h       10.8.13.240   gke-lab4-default-pool-0eb9f919-bn5z
reviews-v2-67ccbd89c7-4jgcr              2/2       Running       248        1d        10.8.7.103    gke-lab4-default-pool-0eb9f919-0qv0
reviews-v3-6fd9fddb9f-mczvf              0/2       Error         194        18h       10.8.13.243   gke-lab4-default-pool-0eb9f919-bn5z
staging-nc-nutcracker-579b75498c-lg5p7   2/2       Running       373        1d        10.8.7.96     gke-lab4-default-pool-0eb9f919-0qv0
xx-homepage-7f97cb6cdf-l5c8s             0/2       Error         194        18h       10.8.13.242   gke-lab4-default-pool-0eb9f919-bn5z
xx-homepage-7f97cb6cdf-r7cnx             2/2       Running       248        1d        10.8.7.107    gke-lab4-default-pool-0eb9f919-0qv0
xx-homepage-7f97cb6cdf-wv4lj             2/2       Running       194        18h       10.8.9.235    gke-lab4-default-pool-0eb9f919-qdxm

All the pods in the same node are killed at the same time. Points to a node issue
reviews-v1-586cb488f9-cjxxz              0/2       Error         194        18h       10.8.13.240   gke-lab4-default-pool-0eb9f919-bn5z
reviews-v3-6fd9fddb9f-mczvf              0/2       Error         194        18h       10.8.13.243   gke-lab4-default-pool-0eb9f919-bn5z
xx-homepage-7f97cb6cdf-l5c8s             0/2       Error         194        18h       10.8.13.242   gke-lab4-default-pool-0eb9f919-bn5z

Wonder what will happen when mchines are recycled (1h to go):
-> %  nodes -owide
NAME                                  STATUS    ROLES     AGE       VERSION         EXTERNAL-IP      OS-IMAGE                             KERNEL-VERSION   CONTAINER-RUNTIME
gke-lab4-default-pool-0eb9f919-0qv0   Ready     <none>    23h       v1.10.2-gke.1   35.187.15.126    Container-Optimized OS from Google   4.14.22+         docker://17.3.2
gke-lab4-default-pool-0eb9f919-bn5z   Ready     <none>    21h       v1.10.2-gke.1   35.195.192.172   Container-Optimized OS from Google   4.14.22+         docker://17.3.2
gke-lab4-default-pool-0eb9f919-j5hm   Ready     <none>    23h       v1.10.2-gke.1   35.205.184.102   Container-Optimized OS from Google   4.14.22+         docker://17.3.2
gke-lab4-default-pool-0eb9f919-qdxm   Ready     <none>    23h       v1.10.2-gke.1   35.233.89.91     Container-Optimized OS from Google   4.14.22+         docker://17.3.2
gke-lab4-default-pool-0eb9f919-w60m   Ready     <none>    23h       v1.10.2-gke.1   35.233.97.202    Container-Optimized OS from Google   4.14.22+         docker://17.3.2

``

* Going to GCP and deleting the node has no effect, pods keeps dying.
* Last time we had this problem we fixed it by modifying the resources of the deployed pods. Not sure why resources are fine until they're not. It feels more like a combination of resources available in the node and resources used by pods. Noisy neighbour?