These are the test results from upgrading using:
export KOPS_FEATURE_FLAGS=+DrainAndValidateRollingUpdate
Create a cluster
kops create cluster --zones us-east-1c --name rolling-update.aws.k8spro.com --yes
Validate cluster
kops validate cluster
Install guestbook application
kubectl create -f guestbook-go
Edit cluster and bump k8s version to 1.7.5.
kops edit cluster
Update
kops update cluster --yes
Run rolling update. I would recommend 3-4 min interval in AWS, but I am doing 2m so that validation will fail.
kops rolling-update cluster --node-interval 2m --master-interval 2m --yes
Master is drained, deleted and replaced.
$ kubectl get no
NAME STATUS AGE VERSION
ip-172-20-44-123.us-east-2.compute.internal Ready 2m v1.7.5
ip-172-20-46-231.us-east-2.compute.internal Ready 31m v1.7.2
ip-172-20-62-12.us-east-2.compute.internal Ready,SchedulingDisabled 30m v1.7.2
First node is drained, pods are moved, and node is deleted
admin@ip-172-20-44-123:~$ kubectl get no
NAME STATUS AGE VERSION
ip-172-20-44-123.us-east-2.compute.internal Ready 4m v1.7.5
ip-172-20-46-231.us-east-2.compute.internal Ready 32m v1.7.2
admin@ip-172-20-44-123:~$ kubectl get po
NAME READY STATUS RESTARTS AGE
guestbook-1xdcj 1/1 Running 0 2m
guestbook-jctpl 1/1 Running 0 31m
guestbook-sx6s8 1/1 Running 0 2m
redis-master-zwjrx 1/1 Running 0 2m
redis-slave-72r3w 1/1 Running 0 2m
redis-slave-l4b0p 1/1 Running 0 31m
Cluster does not validate, because the "2m" interval is not long enough.
New node starts.
$ kubectl get no
NAME STATUS AGE VERSION
ip-172-20-44-123.us-east-2.compute.internal Ready 6m v1.7.5
ip-172-20-46-198.us-east-2.compute.internal Ready 26s v1.7.5
ip-172-20-46-231.us-east-2.compute.internal Ready 35m v1.7.2
Next node is drained:
$ kubectl get no
NAME STATUS AGE VERSION
ip-172-20-44-123.us-east-2.compute.internal Ready 7m v1.7.5
ip-172-20-46-198.us-east-2.compute.internal Ready 1m v1.7.5
ip-172-20-46-231.us-east-2.compute.internal Ready,SchedulingDisabled 36m v1.7.2
Pods are shifted to the running node
$ kubectl get po
NAME READY STATUS RESTARTS AGE
guestbook-j62d5 1/1 Running 0 2m
guestbook-jj2wd 1/1 Running 0 2m
guestbook-zc25v 1/1 Running 0 2m
redis-master-s22kc 1/1 Running 0 2m
redis-slave-1zcpf 1/1 Running 0 2m
redis-slave-lv308 1/1 Running 0 2m
Node is deleted:
$ kubectl get no
NAME STATUS AGE VERSION
ip-172-20-44-123.us-east-2.compute.internal Ready 10m v1.7.5
ip-172-20-46-198.us-east-2.compute.internal Ready 3m v1.7.5
And ASG restarts it
$ kubectl get no
NAME STATUS AGE VERSION
ip-172-20-44-123.us-east-2.compute.internal Ready 10m v1.7.5
ip-172-20-45-95.us-east-2.compute.internal NotReady 3s v1.7.5
ip-172-20-46-198.us-east-2.compute.internal Ready 3m v1.7.5
Log of the rolling update follows:
$ kops rolling-update cluster --node-interval 2m --master-interval 2m --yes
Using cluster from kubectl context: test.aws.k8spro.com
NAME STATUS NEEDUPDATE READY MIN MAX NODES
master-us-east-2c NeedsUpdate 1 0 1 1 1
nodes NeedsUpdate 2 0 2 2 2
I0904 19:20:37.785140 53065 instancegroups.go:269] Draining the node: "ip-172-20-58-62.us-east-2.compute.internal".
node "ip-172-20-58-62.us-east-2.compute.internal" cordoned
node "ip-172-20-58-62.us-east-2.compute.internal" already cordoned
WARNING: Deleting pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: etcd-server-events-ip-172-20-58-62.us-east-2.compute.internal, etcd-server-ip-172-20-58-62.us-east-2.compute.internal, kube-apiserver-ip-172-20-58-62.us-east-2.compute.internal, kube-controller-manager-ip-172-20-58-62.us-east-2.compute.internal, kube-proxy-ip-172-20-58-62.us-east-2.compute.internal, kube-scheduler-ip-172-20-58-62.us-east-2.compute.internal
pod "dns-controller-2912642664-q45mn" evicted
node "ip-172-20-58-62.us-east-2.compute.internal" drained
I0904 19:22:08.782256 53065 instancegroups.go:350] Stopping instance "i-03fb027a438f3a09e", node "ip-172-20-58-62.us-east-2.compute.internal", in AWS ASG "master-us-east-2c.masters.test.aws.k8spro.com".
I0904 19:24:09.255278 53065 instancegroups.go:298] Validating the cluster.
I0904 19:24:18.188988 53065 instancegroups.go:322] Cluster did not validate, and waiting longer: cannot get nodes for "test.aws.k8spro.com": Get https://api.test.aws.k8spro.com/api/v1/nodes: dial tcp 18.220.252.50:443: getsockopt: operation timed out.
I0904 19:25:27.106561 53065 instancegroups.go:322] Cluster did not validate, and waiting longer: cannot get nodes for "test.aws.k8spro.com": Get https://api.test.aws.k8spro.com/api/v1/nodes: dial tcp 18.220.252.50:443: getsockopt: operation timed out.
I0904 19:26:36.093120 53065 instancegroups.go:322] Cluster did not validate, and waiting longer: cannot get nodes for "test.aws.k8spro.com": Get https://api.test.aws.k8spro.com/api/v1/nodes: dial tcp 18.220.252.50:443: getsockopt: operation timed out.
I0904 19:27:45.096896 53065 instancegroups.go:322] Cluster did not validate, and waiting longer: cannot get nodes for "test.aws.k8spro.com": Get https://api.test.aws.k8spro.com/api/v1/nodes: dial tcp 18.220.252.50:443: getsockopt: operation timed out.
I0904 19:28:54.065467 53065 instancegroups.go:322] Cluster did not validate, and waiting longer: cannot get nodes for "test.aws.k8spro.com": Get https://api.test.aws.k8spro.com/api/v1/nodes: dial tcp 18.220.252.50:443: getsockopt: operation timed out.
I0904 19:29:55.292780 53065 instancegroups.go:325] Cluster validated.
I0904 19:29:55.917807 53065 instancegroups.go:269] Draining the node: "ip-172-20-62-12.us-east-2.compute.internal".
node "ip-172-20-62-12.us-east-2.compute.internal" cordoned
node "ip-172-20-62-12.us-east-2.compute.internal" already cordoned
WARNING: Deleting pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: kube-proxy-ip-172-20-62-12.us-east-2.compute.internal
pod "redis-master-r15mv" evicted
pod "guestbook-g8q69" evicted
pod "guestbook-5l548" evicted
pod "redis-slave-2rqzx" evicted
node "ip-172-20-62-12.us-east-2.compute.internal" drained
I0904 19:31:27.184734 53065 instancegroups.go:350] Stopping instance "i-005e4eefb28fae80c", node "ip-172-20-62-12.us-east-2.compute.internal", in AWS ASG "nodes.test.aws.k8spro.com".
I0904 19:33:27.693300 53065 instancegroups.go:298] Validating the cluster.
I0904 19:33:28.406197 53065 instancegroups.go:322] Cluster did not validate, and waiting longer: your nodes are NOT ready test.aws.k8spro.com.
I0904 19:34:29.024948 53065 instancegroups.go:322] Cluster did not validate, and waiting longer: your nodes are NOT ready test.aws.k8spro.com.
I0904 19:35:29.844784 53065 instancegroups.go:325] Cluster validated.
I0904 19:35:29.844810 53065 instancegroups.go:269] Draining the node: "ip-172-20-46-231.us-east-2.compute.internal".
node "ip-172-20-46-231.us-east-2.compute.internal" cordoned
node "ip-172-20-46-231.us-east-2.compute.internal" already cordoned
WARNING: Deleting pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: kube-proxy-ip-172-20-46-231.us-east-2.compute.internal
pod "guestbook-jctpl" evicted
pod "redis-master-zwjrx" evicted
pod "kube-dns-autoscaler-1818915203-cw16z" evicted
pod "redis-slave-72r3w" evicted
pod "redis-slave-l4b0p" evicted
pod "guestbook-1xdcj" evicted
pod "guestbook-sx6s8" evicted
pod "kube-dns-479524115-qwhn3" evicted
pod "kube-dns-479524115-jfxkk" evicted
node "ip-172-20-46-231.us-east-2.compute.internal" drained
I0904 19:37:02.325394 53065 instancegroups.go:350] Stopping instance "i-0c691fe3bee4fc76f", node "ip-172-20-46-231.us-east-2.compute.internal", in AWS ASG "nodes.test.aws.k8spro.com".
I0904 19:39:02.805836 53065 instancegroups.go:298] Validating the cluster.
I0904 19:39:03.270910 53065 instancegroups.go:325] Cluster validated.
I0904 19:39:03.270952 53065 rollingupdate.go:174] Rolling update completed!