Original message in kubernetes-novice Slack:
I am using Container Linux with Kubespray and I have found that after a short while new deployments all fail due to pod errors with:
Pod sandbox changed, it will be killed and re-created.
and subsequently,Failed create pod sandbox.
This happens repeatedly. It appears that the issue is due to Container Linux automatically updating itself and rebooting the node. In the end, deployments only work on the masters, which have aNo reboots
update strategy, but in my understanding this only means the masters will not reboot automatically--the update is still downloaded and if/when the masters are rebooted, they will also be in the same state. I have compared a node that hasn’t yet updated to an updated node and can’t find any differences. Has anybody else seen this?
Took it to #kubespray on advice of a friendly user. Others have seen it but were messing about with routing and such. Further investigation spurred by the discussion showed that /run/flannel
doesn't exist on failing nodes. It should get created on or prior to Flannel startup.
The master nodes, which have a no-reboot policy, have /etc/kubernetes/cni-flannel.yml
. All of the non-masters have updated themselves except for one case where I manually disabled the reboot. Non of the non-masters have this file.
According to this 2015 blog post, Flannel depends on etcd. I did see earlier indications that information retrieval on node startup had issues; perhaps information couldn't be retrieved in order to configure properly?
So, looking at the startup logs for Kubelet, I see the following errors:
$ journalctl -u kubelet -b | awk '$6 ~ /^E/' | cut -d' ' -f7-
20:17:52.114637 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:52.114655 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:52.114841 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:52.831760 793 kubelet.go:1275] Image garbage collection failed once. Stats initialization may not have completed yet: failed to get imageFs info: unable to find data for container /
20:17:52.982097 793 event.go:209] Unable to write event: 'Post https://localhost:6443/api/v1/namespaces/default/events: dial tcp 127.0.0.1:6443: getsockopt: connection refused' (may retry after sleeping)
20:17:53.115875 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:53.116379 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:53.117454 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:54.117113 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:54.117967 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:54.118999 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:54.251058 793 container_manager_linux.go:583] [ContainerManager]: Fail to get rootfs information unable to find data for container /
20:17:55.118247 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:55.118885 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:55.119922 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:55.251704 793 container_manager_linux.go:583] [ContainerManager]: Fail to get rootfs information unable to find data for container /
20:17:55.423091 793 kubelet_node_status.go:106] Unable to register node "CLUSTER-k8s-node-nf-2" with API server: Post https://localhost:6443/api/v1/nodes: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:56.119009 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:56.120042 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:56.121176 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:56.251918 793 container_manager_linux.go:583] [ContainerManager]: Fail to get rootfs information unable to find data for container /
20:17:56.719879 793 eviction_manager.go:238] eviction manager: unexpected err: failed to get node info: node "CLUSTER-k8s-node-nf-2" not found
20:17:57.055760 793 kubelet_node_status.go:106] Unable to register node "CLUSTER-k8s-node-nf-2" with API server: Post https://localhost:6443/api/v1/nodes: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:57.119898 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:57.120856 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:57.121584 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:58.121506 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:58.122340 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:58.123438 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:58.398601 793 kubelet_node_status.go:106] Unable to register node "CLUSTER-k8s-node-nf-2" with API server: Post https://localhost:6443/api/v1/nodes: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:59.122603 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:59.123712 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:59.124793 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:17:59.357489 793 event.go:209] Unable to write event: 'Post https://localhost:6443/api/v1/namespaces/default/events: dial tcp 127.0.0.1:6443: getsockopt: connection refused' (may retry after sleeping)
20:18:00.123313 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:00.124245 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:00.125363 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:00.144142 793 kubelet_node_status.go:106] Unable to register node "CLUSTER-k8s-node-nf-2" with API server: Post https://localhost:6443/api/v1/nodes: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:01.124116 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:01.125257 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:01.126193 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:02.124624 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:02.125708 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:02.126738 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:02.756473 793 kubelet_node_status.go:106] Unable to register node "CLUSTER-k8s-node-nf-2" with API server: Post https://localhost:6443/api/v1/nodes: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:03.125511 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:03.126448 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:03.127597 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:03.290398 793 kuberuntime_manager.go:860] PodSandboxStatus of sandbox "1aa458cf598b9ef5107d8b1b5cc98b5de15a52714c4d78c72ba6d5fe0f0fb230" for pod "nginx-proxy-CLUSTER-k8s-node-nf-2_kube-system(921661799412eb721a9cc4d4ad76b5eb)" error: rpc error: code = Unknown desc = Error: No such container: 1aa458cf598b9ef5107d8b1b5cc98b5de15a52714c4d78c72ba6d5fe0f0fb230
20:18:03.290619 793 generic.go:241] PLEG: Ignoring events for pod nginx-proxy-CLUSTER-k8s-node-nf-2/kube-system: rpc error: code = Unknown desc = Error: No such container: 1aa458cf598b9ef5107d8b1b5cc98b5de15a52714c4d78c72ba6d5fe0f0fb230
20:18:04.126381 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:04.127282 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:04.128007 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:05.127096 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:05.127906 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:05.128908 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:06.127783 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:06.128585 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:06.129904 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:06.720220 793 eviction_manager.go:238] eviction manager: unexpected err: failed to get node info: node "CLUSTER-k8s-node-nf-2" not found
20:18:07.104924 793 kubelet_node_status.go:106] Unable to register node "CLUSTER-k8s-node-nf-2" with API server: Post https://localhost:6443/api/v1/nodes: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:07.128713 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:07.129738 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:07.131176 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:08.129267 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:08.130261 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:08.131647 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:09.129970 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:09.130875 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:09.132026 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:09.358311 793 event.go:209] Unable to write event: 'Post https://localhost:6443/api/v1/namespaces/default/events: dial tcp 127.0.0.1:6443: getsockopt: connection refused' (may retry after sleeping)
20:18:10.131202 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:10.132025 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:10.133055 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:11.132296 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:11.133204 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:11.134245 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:12.132793 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:12.133839 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:12.134844 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:13.133849 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:13.134679 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:13.135757 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:14.134445 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:14.135833 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:14.136375 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:14.463039 793 kubelet_node_status.go:106] Unable to register node "CLUSTER-k8s-node-nf-2" with API server: Post https://localhost:6443/api/v1/nodes: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:15.135181 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:15.136306 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:15.137391 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:16.136087 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:474: Failed to list *v1.Node: Get https://localhost:6443/api/v1/nodes?fieldSelector=metadata.name%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:16.137042 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:465: Failed to list *v1.Service: Get https://localhost:6443/api/v1/services?limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:16.138126 793 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:47: Failed to list *v1.Pod: Get https://localhost:6443/api/v1/pods?fieldSelector=spec.nodeName%3DCLUSTER-k8s-node-nf-2&limit=500&resourceVersion=0: dial tcp 127.0.0.1:6443: getsockopt: connection refused
20:18:16.720724 793 eviction_manager.go:238] eviction manager: unexpected err: failed to get node info: node "CLUSTER-k8s-node-nf-2" not found
20:19:09.312976 793 pod_workers.go:186] Error syncing pod 9eebfe84-1367-11e8-9b9e-fa163e68e6b9 ("kube-flannel-tgncj_kube-system(9eebfe84-1367-11e8-9b9e-fa163e68e6b9)"), skipping: failed to "StartContainer" for "kube-flannel" with CrashLoopBackOff: "Back-off 10s restarting failed container=kube-flannel pod=kube-flannel-tgncj_kube-system(9eebfe84-1367-11e8-9b9e-fa163e68e6b9)"
[... ad nauseam ... ]
Comparing IP tables on a node where it works to one where it doesn't work. The IP tables on a node where it doesn't is practically blank. So this leads me to believe iptables is not restoring state after a reboot.
I have verified this by rebuilding a node (see later), verifying that it's fine with the new version of Container Linux, saving its iptables configuration to a temporary file:
$ sudo iptables-save > iptables.save
Then rebooting, verifying the iptables is blank and that Flannel is having trouble, and then restoring the firewall and restarting the Kubelet:
$ sudo iptables-restore < iptables.save
$ sudo systemctl restart kubelet
Then I journalctl -u kubelet -f
and things seem to be working fine. I verify the Flannel pod is running okay by determining which pods are running on the node (kubectl describe node $NODE
) and then looking at the recent events for the Flannel pod (kubectl describe pod $POD -n kube-system
):
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SuccessfulMountVolume 8m kubelet, CLUSTER-k8s-node-nf-4 MountVolume.SetUp succeeded for volume "run"
Normal SuccessfulMountVolume 8m kubelet, CLUSTER-k8s-node-nf-4 MountVolume.SetUp succeeded for volume "cni"
Normal SuccessfulMountVolume 8m kubelet, CLUSTER-k8s-node-nf-4 MountVolume.SetUp succeeded for volume "host-cni-bin"
Normal SuccessfulMountVolume 8m kubelet, CLUSTER-k8s-node-nf-4 MountVolume.SetUp succeeded for volume "flannel-token-r8pw6"
Normal SuccessfulMountVolume 8m kubelet, CLUSTER-k8s-node-nf-4 MountVolume.SetUp succeeded for volume "flannel-cfg"
Normal SandboxChanged 8m kubelet, CLUSTER-k8s-node-nf-4 Pod sandbox changed, it will be killed and re-created.
Normal Pulled 8m kubelet, CLUSTER-k8s-node-nf-4 Container image "quay.io/coreos/flannel-cni:v0.3.0" already present on machine
Normal Created 8m kubelet, CLUSTER-k8s-node-nf-4 Created container
Normal Started 8m kubelet, CLUSTER-k8s-node-nf-4 Started container
Warning BackOff 7m kubelet, CLUSTER-k8s-node-nf-4 Back-off restarting failed container
Normal Pulled 7m (x2 over 8m) kubelet, CLUSTER-k8s-node-nf-4 Container image "quay.io/coreos/flannel:v0.9.1" already present on machine
Normal Created 7m (x2 over 8m) kubelet, CLUSTER-k8s-node-nf-4 Created container
Normal Started 7m (x2 over 8m) kubelet, CLUSTER-k8s-node-nf-4 Started container
Normal SuccessfulMountVolume 7m kubelet, CLUSTER-k8s-node-nf-4 MountVolume.SetUp succeeded for volume "cni"
Normal SuccessfulMountVolume 7m kubelet, CLUSTER-k8s-node-nf-4 MountVolume.SetUp succeeded for volume "host-cni-bin"
Normal SuccessfulMountVolume 7m kubelet, CLUSTER-k8s-node-nf-4 MountVolume.SetUp succeeded for volume "run"
Normal SuccessfulMountVolume 7m kubelet, CLUSTER-k8s-node-nf-4 MountVolume.SetUp succeeded for volume "flannel-cfg"
Normal SuccessfulMountVolume 7m kubelet, CLUSTER-k8s-node-nf-4 MountVolume.SetUp succeeded for volume "flannel-token-r8pw6"
Note how the pod hasn't gone into BackOff
state (you may also note I'm in a bit of a rush at the moment and didn't give it much of a chance to do so; that will have to come later).
So now I need to determine how to restore the iptables state on a reboot, starting with whether the state is getting saved in the first place... when should that happen? When the node is provisioned or, more typically (?), at system reboot?
And is this really the problem?
For now it appears the simple fix is to enable the iptables save and restore unit files in systemd:
$ sudo systemctl enable iptables-store.service
$ sudo systemctl enable iptables-restore.service
So what I want to know now is if the problem is simply that Container Linux disabled these by default at some point, and why?
The next day
It turns out according to this comment that kube-proxy should be resetting this state. I need to follow up on this, but first, I got frustrated by the lack of being able to easily ssh into the various nodes, because I am no good at remembering the master's IP and the node cluster IPs. There appears to be a way to generate an SSH configuration from the Ansible playbooks but ugh, I don't know Ansible well enough to know how to trigger the task (roles/bastion-ssh-config
).
This doesn't fix the problem, but allows me to re-spray a node after an update.
- Reset the node to clear it off:
ansible-playbook --become -i contrib/terraform/openstack/hosts reset.yml --limit=$NODE
- Rebuild the cache:
ansible --become -i contrib/terraform/openstack/hosts all -m setup
- Re-spray the node:
ansible-playbook --become -i contrib/terraform/openstack/hosts cluster.yml --limit=$NODE
I had to rebuild the cache because I was getting a really hard-to-parse (for me) error about, I think, the API server not having a default address defined:
{"msg": "The field 'environment' has an invalid value, which includes an undefined variable. The error was: {u'no_proxy': u'{{ no_proxy }}', u'https_proxy': u\"{{ https_proxy| default ('') }}\", u'http_proxy': u\"{{ http_proxy| default ('') }}\"}: {%- if loadbalancer_apiserver is defined -%} {{ apiserver_loadbalancer_domain_name| default('') }}, {{ loadbalancer_apiserver.address | default('') }}, {%- endif -%} {%- for item in (groups['k8s-cluster'] + groups['etcd'] + groups['calico-rr']|default([]))|unique -%} {{ hostvars[item]['access_ip'] | default(hostvars[item]['ip'] | default(hostvars[item]['ansible_default_ipv4']['address'])) }}, {%- if (item != hostvars[item]['ansible_hostname']) -%} {{ hostvars[item]['ansible_hostname'] }}, {{ hostvars[item]['ansible_hostname'] }}.{{ dns_domain }}, {%- endif -%} {{ item }},{{ item }}.{{ dns_domain }}, {%- endfor -%} 127.0.0.1,localhost: 'dict object' has no attribute 'ansible_default_ipv4'\n\nThe error appears to have been in 'HOME/CLUSTER/roles/kubespray-defaults/tasks/main.yaml': line 2, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n---\n- name: Configure defaults\n ^ here\n\nexception type: <class 'ansible.errors.AnsibleUndefinedVariable'>\nexception: {u'no_proxy': u'{{ no_proxy }}', u'https_proxy': u\"{{ https_proxy| default ('') }}\", u'http_proxy': u\"{{ http_proxy| default ('') }}\"}: {%- if loadbalancer_apiserver is defined -%} {{ apiserver_loadbalancer_domain_name| default('') }}, {{ loadbalancer_apiserver.address | default('') }}, {%- endif -%} {%- for item in (groups['k8s-cluster'] + groups['etcd'] + groups['calico-rr']|default([]))|unique -%} {{ hostvars[item]['access_ip'] | default(hostvars[item]['ip'] | default(hostvars[item]['ansible_default_ipv4']['address'])) }}, {%- if (item != hostvars[item]['ansible_hostname']) -%} {{ hostvars[item]['ansible_hostname'] }}, {{ hostvars[item]['ansible_hostname'] }}.{{ dns_domain }}, {%- endif -%} {{ item }},{{ item }}.{{ dns_domain }}, {%- endfor -%} 127.0.0.1,localhost: 'dict object' has no attribute 'ansible_default_ipv4'"}
I have tried this out with multiple updates, including before and after a commit which made changes to the iptables sync configuration for kube-proxy, and I see the problem regardless. I am starting to think the issue may have to do with re-running aborted Ansible runs. On this last build I had some SSH keys left over in my
known_hosts
file which made Ansible unable to contact some hosts.I still don't understand where the IP tables rules come from, but I was wrong in thinking iptables rules needed to be in place in order to talk to the API server. It is not proxied through forwarded iptables -- it's proxied by nginx-proxy, not kube-proxy.
On nodes which are screwed up, I am currently seeing the following via
kubectl logs -n kube-system $NODE
:This is odd because on those nodes
dig localhost@10.0.0.3
works fine.So looking at Flannel logs to try and figure that out, I'm seeing another issue where the subnet manager can't be started:
The subnet for 10.233.0.1 shouldn't be there.
I would like to set this aside for now; I feel like I'm going around in circles. I have the luxury of having enough space to build two clusters. I can use one for development and use the other to see if the older kubespray deployment can maintain running CL nodes, if I don't interrupt the Ansible run. There's no point continuing the investigation if I'm inadvertently breaking the deployments... although I've learned quite a bit about what I'm working with here. (Whee!)