Skip to content

Instantly share code, notes, and snippets.

@dleske
Last active February 23, 2018 05:35
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dleske/a2fe9356605ff3c8c8fb09c7eb954393 to your computer and use it in GitHub Desktop.
Save dleske/a2fe9356605ff3c8c8fb09c7eb954393 to your computer and use it in GitHub Desktop.
k8s: Container Linux breaks Flannel after update/reboot

Original message in kubernetes-novice Slack:

I am using Container Linux with Kubespray and I have found that after a short while new deployments all fail due to pod errors with: Pod sandbox changed, it will be killed and re-created. and subsequently, Failed create pod sandbox. This happens repeatedly. It appears that the issue is due to Container Linux automatically updating itself and rebooting the node. In the end, deployments only work on the masters, which have a No reboots update strategy, but in my understanding this only means the masters will not reboot automatically--the update is still downloaded and if/when the masters are rebooted, they will also be in the same state. I have compared a node that hasn’t yet updated to an updated node and can’t find any differences. Has anybody else seen this?

Took it to #kubespray on advice of a friendly user. Others have seen it but were messing about with routing and such. Further investigation spurred by the discussion showed that /run/flannel doesn't exist on failing nodes. It should get created on or prior to Flannel startup.

The master nodes, which have a no-reboot policy, have /etc/kubernetes/cni-flannel.yml. All of the non-masters have updated themselves except for one case where I manually disabled the reboot. Non of the non-masters have this file.

According to this 2015 blog post, Flannel depends on etcd. I did see earlier indications that information retrieval on node startup had issues; perhaps information couldn't be retrieved in order to configure properly?

So, looking at the startup logs for Kubelet, I see the following errors:

$ journalctl -u kubelet -b | awk '$6 ~ /^E/' | cut -d' ' -f7-
20:17:52.114637     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:52.114655     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:52.114841     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:52.831760     793 kubelet.go:1275] Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data for container /
20:17:52.982097     793 event.go:209] Unable to write event: 'Post https://localhost:6443/api/v1/namespaces/default/events: dial tcp 127.0.0.1:6443: getsockopt: connection refused' (may retry after sleeping)
20:17:53.115875     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:53.116379     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:53.117454     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:54.117113     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:54.117967     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:54.118999     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:54.251058     793 container_manager_linux.go:583] [ContainerManager]: Fail to get rootfs information unable to find data for container /
20:17:55.118247     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:55.118885     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:55.119922     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:55.251704     793 container_manager_linux.go:583] [ContainerManager]: Fail to get rootfs information unable to find data for container /
20:17:55.423091     793 kubelet_node_status.go:106] Unable to register node "CLUSTER-k8s-node-nf-2" with API server: Post https://localhost:6443/api/v1/nodes: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:56.119009     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:56.120042     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:56.121176     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:56.251918     793 container_manager_linux.go:583] [ContainerManager]: Fail to get rootfs information unable to find data for container /
20:17:56.719879     793 eviction_manager.go:238] eviction manager: unexpected err: failed to get node info: node "CLUSTER-k8s-node-nf-2" not found
20:17:57.055760     793 kubelet_node_status.go:106] Unable to register node "CLUSTER-k8s-node-nf-2" with API server: Post https://localhost:6443/api/v1/nodes: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:57.119898     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:57.120856     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:57.121584     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:58.121506     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:58.122340     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:58.123438     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:58.398601     793 kubelet_node_status.go:106] Unable to register node "CLUSTER-k8s-node-nf-2" with API server: Post https://localhost:6443/api/v1/nodes: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:59.122603     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:59.123712     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:59.124793     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:59.357489     793 event.go:209] Unable to write event: 'Post https://localhost:6443/api/v1/namespaces/default/events: dial tcp 127.0.0.1:6443: getsockopt: connection refused' (may retry after sleeping)
20:18:00.123313     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:00.124245     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:00.125363     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:00.144142     793 kubelet_node_status.go:106] Unable to register node "CLUSTER-k8s-node-nf-2" with API server: Post https://localhost:6443/api/v1/nodes: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:01.124116     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:01.125257     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:01.126193     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:02.124624     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:02.125708     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:02.126738     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:02.756473     793 kubelet_node_status.go:106] Unable to register node "CLUSTER-k8s-node-nf-2" with API server: Post https://localhost:6443/api/v1/nodes: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:03.125511     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:03.126448     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:03.127597     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:03.290398     793 kuberuntime_manager.go:860] PodSandboxStatus of sandbox "1aa458cf598b9ef5107d8b1b5cc98b5de15a52714c4d78c72ba6d5fe0f0fb230" for pod "nginx-proxy-CLUSTER-k8s-node-nf-2_kube-system(921661799412eb721a9cc4d4ad76b5eb)" error: rpc error: code = Unknown desc = Error: No such container: 1aa458cf598b9ef5107d8b1b5cc98b5de15a52714c4d78c72ba6d5fe0f0fb230
20:18:03.290619     793 generic.go:241] PLEG: Ignoring events for pod nginx-proxy-CLUSTER-k8s-node-nf-2/kube-system: rpc error: code = Unknown desc = Error: No such container: 1aa458cf598b9ef5107d8b1b5cc98b5de15a52714c4d78c72ba6d5fe0f0fb230
20:18:04.126381     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:04.127282     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:04.128007     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:05.127096     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:05.127906     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:05.128908     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:06.127783     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:06.128585     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:06.129904     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:06.720220     793 eviction_manager.go:238] eviction manager: unexpected err: failed to get node info: node "CLUSTER-k8s-node-nf-2" not found
20:18:07.104924     793 kubelet_node_status.go:106] Unable to register node "CLUSTER-k8s-node-nf-2" with API server: Post https://localhost:6443/api/v1/nodes: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:07.128713     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:07.129738     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:07.131176     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:08.129267     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:08.130261     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:08.131647     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:09.129970     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:09.130875     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:09.132026     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:09.358311     793 event.go:209] Unable to write event: 'Post https://localhost:6443/api/v1/namespaces/default/events: dial tcp 127.0.0.1:6443: getsockopt: connection refused' (may retry after sleeping)
20:18:10.131202     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:10.132025     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:10.133055     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:11.132296     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:11.133204     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:11.134245     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:12.132793     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:12.133839     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:12.134844     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:13.133849     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:13.134679     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:13.135757     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:14.134445     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:14.135833     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:14.136375     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:14.463039     793 kubelet_node_status.go:106] Unable to register node "CLUSTER-k8s-node-nf-2" with API server: Post https://localhost:6443/api/v1/nodes: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:15.135181     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:15.136306     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:15.137391     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:16.136087     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:16.137042     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:16.138126     793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:16.720724     793 eviction_manager.go:238] eviction manager: unexpected err: failed to get node info: node "CLUSTER-k8s-node-nf-2" not found
20:19:09.312976     793 pod_workers.go:186] Error syncing pod 9eebfe84-1367-11e8-9b9e-fa163e68e6b9 ("kube-flannel-tgncj_kube-system(9eebfe84-1367-11e8-9b9e-fa163e68e6b9)"), skipping: failed to "StartContainer" for "kube-flannel" with CrashLoopBackOff: "Back-off 10s restarting failed container=kube-flannel pod=kube-flannel-tgncj_kube-system(9eebfe84-1367-11e8-9b9e-fa163e68e6b9)"
[... ad nauseam ... ]

Comparing IP tables on a node where it works to one where it doesn't work. The IP tables on a node where it doesn't is practically blank. So this leads me to believe iptables is not restoring state after a reboot.

So iptables is the problem

I have verified this by rebuilding a node (see later), verifying that it's fine with the new version of Container Linux, saving its iptables configuration to a temporary file:

$ sudo iptables-save > iptables.save

Then rebooting, verifying the iptables is blank and that Flannel is having trouble, and then restoring the firewall and restarting the Kubelet:

$ sudo iptables-restore < iptables.save
$ sudo systemctl restart kubelet

Then I journalctl -u kubelet -f and things seem to be working fine. I verify the Flannel pod is running okay by determining which pods are running on the node (kubectl describe node $NODE) and then looking at the recent events for the Flannel pod (kubectl describe pod $POD -n kube-system):

Events:
  Type     Reason                 Age              From                         Message
  ----     ------                 ----             ----                         -------
  Normal   SuccessfulMountVolume  8m               kubelet, CLUSTER-k8s-node-nf-4  MountVolume.SetUp succeeded for volume "run"
  Normal   SuccessfulMountVolume  8m               kubelet, CLUSTER-k8s-node-nf-4  MountVolume.SetUp succeeded for volume "cni"
  Normal   SuccessfulMountVolume  8m               kubelet, CLUSTER-k8s-node-nf-4  MountVolume.SetUp succeeded for volume "host-cni-bin"
  Normal   SuccessfulMountVolume  8m               kubelet, CLUSTER-k8s-node-nf-4  MountVolume.SetUp succeeded for volume "flannel-token-r8pw6"
  Normal   SuccessfulMountVolume  8m               kubelet, CLUSTER-k8s-node-nf-4  MountVolume.SetUp succeeded for volume "flannel-cfg"
  Normal   SandboxChanged         8m               kubelet, CLUSTER-k8s-node-nf-4  Pod sandbox changed, it will be killed and re-created.
  Normal   Pulled                 8m               kubelet, CLUSTER-k8s-node-nf-4  Container image "quay.io/coreos/flannel-cni:v0.3.0" already present on machine
  Normal   Created                8m               kubelet, CLUSTER-k8s-node-nf-4  Created container
  Normal   Started                8m               kubelet, CLUSTER-k8s-node-nf-4  Started container
  Warning  BackOff                7m               kubelet, CLUSTER-k8s-node-nf-4  Back-off restarting failed container
  Normal   Pulled                 7m (x2 over 8m)  kubelet, CLUSTER-k8s-node-nf-4  Container image "quay.io/coreos/flannel:v0.9.1" already present on machine
  Normal   Created                7m (x2 over 8m)  kubelet, CLUSTER-k8s-node-nf-4  Created container
  Normal   Started                7m (x2 over 8m)  kubelet, CLUSTER-k8s-node-nf-4  Started container
  Normal   SuccessfulMountVolume  7m               kubelet, CLUSTER-k8s-node-nf-4  MountVolume.SetUp succeeded for volume "cni"
  Normal   SuccessfulMountVolume  7m               kubelet, CLUSTER-k8s-node-nf-4  MountVolume.SetUp succeeded for volume "host-cni-bin"
  Normal   SuccessfulMountVolume  7m               kubelet, CLUSTER-k8s-node-nf-4  MountVolume.SetUp succeeded for volume "run"
  Normal   SuccessfulMountVolume  7m               kubelet, CLUSTER-k8s-node-nf-4  MountVolume.SetUp succeeded for volume "flannel-cfg"
  Normal   SuccessfulMountVolume  7m               kubelet, CLUSTER-k8s-node-nf-4  MountVolume.SetUp succeeded for volume "flannel-token-r8pw6"

Note how the pod hasn't gone into BackOff state (you may also note I'm in a bit of a rush at the moment and didn't give it much of a chance to do so; that will have to come later).

So now I need to determine how to restore the iptables state on a reboot, starting with whether the state is getting saved in the first place... when should that happen? When the node is provisioned or, more typically (?), at system reboot?

And is this really the problem?

For now it appears the simple fix is to enable the iptables save and restore unit files in systemd:

$ sudo systemctl enable iptables-store.service
$ sudo systemctl enable iptables-restore.service

So what I want to know now is if the problem is simply that Container Linux disabled these by default at some point, and why?

The next day

It turns out according to this comment that kube-proxy should be resetting this state. I need to follow up on this, but first, I got frustrated by the lack of being able to easily ssh into the various nodes, because I am no good at remembering the master's IP and the node cluster IPs. There appears to be a way to generate an SSH configuration from the Ansible playbooks but ugh, I don't know Ansible well enough to know how to trigger the task (roles/bastion-ssh-config).

Ancillary notes

To rebuild a node

This doesn't fix the problem, but allows me to re-spray a node after an update.

  1. Reset the node to clear it off:
ansible-playbook --become -i contrib/terraform/openstack/hosts reset.yml --limit=$NODE
  1. Rebuild the cache:
ansible --become -i contrib/terraform/openstack/hosts all  -m setup
  1. Re-spray the node:
ansible-playbook --become -i contrib/terraform/openstack/hosts cluster.yml --limit=$NODE

I had to rebuild the cache because I was getting a really hard-to-parse (for me) error about, I think, the API server not having a default address defined:

{"msg": "The field 'environment' has an invalid value, which includes an undefined variable. The error was: {u'no_proxy': u'{{ no_proxy }}', u'https_proxy': u\"{{ https_proxy| default ('') }}\", u'http_proxy': u\"{{ http_proxy| default ('') }}\"}: {%- if loadbalancer_apiserver is defined -%} {{ apiserver_loadbalancer_domain_name| default('') }}, {{ loadbalancer_apiserver.address | default('') }}, {%- endif -%} {%- for item in (groups['k8s-cluster'] + groups['etcd'] + groups['calico-rr']|default([]))|unique -%} {{ hostvars[item]['access_ip'] | default(hostvars[item]['ip'] | default(hostvars[item]['ansible_default_ipv4']['address'])) }}, {%-   if (item != hostvars[item]['ansible_hostname']) -%} {{ hostvars[item]['ansible_hostname'] }}, {{ hostvars[item]['ansible_hostname'] }}.{{ dns_domain }}, {%-   endif -%} {{ item }},{{ item }}.{{ dns_domain }}, {%- endfor -%} 127.0.0.1,localhost: 'dict object' has no attribute 'ansible_default_ipv4'\n\nThe error appears to have been in 'HOME/CLUSTER/roles/kubespray-defaults/tasks/main.yaml': line 2, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n---\n- name: Configure defaults\n  ^ here\n\nexception type: <class 'ansible.errors.AnsibleUndefinedVariable'>\nexception: {u'no_proxy': u'{{ no_proxy }}', u'https_proxy': u\"{{ https_proxy| default ('') }}\", u'http_proxy': u\"{{ http_proxy| default ('') }}\"}: {%- if loadbalancer_apiserver is defined -%} {{ apiserver_loadbalancer_domain_name| default('') }}, {{ loadbalancer_apiserver.address | default('') }}, {%- endif -%} {%- for item in (groups['k8s-cluster'] + groups['etcd'] + groups['calico-rr']|default([]))|unique -%} {{ hostvars[item]['access_ip'] | default(hostvars[item]['ip'] | default(hostvars[item]['ansible_default_ipv4']['address'])) }}, {%-   if (item != hostvars[item]['ansible_hostname']) -%} {{ hostvars[item]['ansible_hostname'] }}, {{ hostvars[item]['ansible_hostname'] }}.{{ dns_domain }}, {%-   endif -%} {{ item }},{{ item }}.{{ dns_domain }}, {%- endfor -%} 127.0.0.1,localhost: 'dict object' has no attribute 'ansible_default_ipv4'"}
@dleske
Copy link
Author

dleske commented Feb 23, 2018

I have tried this out with multiple updates, including before and after a commit which made changes to the iptables sync configuration for kube-proxy, and I see the problem regardless. I am starting to think the issue may have to do with re-running aborted Ansible runs. On this last build I had some SSH keys left over in my known_hosts file which made Ansible unable to contact some hosts.

I still don't understand where the IP tables rules come from, but I was wrong in thinking iptables rules needed to be in place in order to talk to the API server. It is not proxied through forwarded iptables -- it's proxied by nginx-proxy, not kube-proxy.

On nodes which are screwed up, I am currently seeing the following via kubectl logs -n kube-system $NODE:

E0223 01:02:59.372088       1 reflector.go:205] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:85: Failed to list *core.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp: lookup localhost on 10.0.0.3:53: no such host
E0223 01:02:59.872711       1 reflector.go:205] k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:85: Failed to list *core.Endpoints: Get https://localhost:6443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp: lookup localhost on 10.0.0.3:53: no such host

This is odd because on those nodes dig localhost@10.0.0.3 works fine.

So looking at Flannel logs to try and figure that out, I'm seeing another issue where the subnet manager can't be started:

E0223 01:10:14.460071       1 main.go:231] Failed to create SubnetManager: error retrieving pod spec for 'kube-system/kube-flannel-5t69n': Get https://10.233.0.1:443/api/v1/namespaces/kube-system/pods/kube-flannel-5t69n: dial tcp 10.233.0.1:443: i/o timeout

The subnet for 10.233.0.1 shouldn't be there.

I would like to set this aside for now; I feel like I'm going around in circles. I have the luxury of having enough space to build two clusters. I can use one for development and use the other to see if the older kubespray deployment can maintain running CL nodes, if I don't interrupt the Ansible run. There's no point continuing the investigation if I'm inadvertently breaking the deployments... although I've learned quite a bit about what I'm working with here. (Whee!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment