Attempt to recreate problem the @justinsb is having with 2s interval
Create cluster
kops create cluster --zones us-east-1c --name rolling-update.aws.k8spro.com --yes
Validate Cluster
$ kops validate cluster
Using cluster from kubectl context: rolling-update.aws.k8spro.com
Validating cluster rolling-update.aws.k8spro.com
INSTANCE GROUPS
NAME ROLE MACHINETYPE MIN MAX SUBNETS
master-us-east-1c Master m3.medium 1 1 us-east-1c
nodes Node t2.medium 2 2 us-east-1c
NODE STATUS
NAME ROLE READY
ip-172-20-36-149.ec2.internal node True
ip-172-20-51-88.ec2.internal node True
ip-172-20-52-25.ec2.internal master True
Your cluster rolling-update.aws.k8spro.com is ready
Upgrade the cluster
$ kops upgrade cluster --channel alpha --yes
Using cluster from kubectl context: rolling-update.aws.k8spro.com
ITEM PROPERTY OLD NEW
Cluster Channel stable alpha
Cluster KubernetesVersion 1.7.2 1.7.4
Updates applied to configuration.
You can now apply these changes, using `kops update cluster rolling-update.aws.k8spro.com`
Update the cluster
kops update cluster rolling-update.aws.k8spro.com --yes
See how rolling update will work
$ kops rolling-update cluster --node-interval 2s --master-interval 2s
Using cluster from kubectl context: rolling-update.aws.k8spro.com
NAME STATUS NEEDUPDATE READY MIN MAX NODES
master-us-east-1c NeedsUpdate 1 0 1 1 1
nodes NeedsUpdate 2 0 2 2 2
Must specify --yes to rolling-update.
Roll the cluster
$ kops rolling-update cluster --node-interval 2s --master-interval 2s --yes
Pods are drain and evited from the master
$ kubectl -n kube-system get po
NAME READY STATUS RESTARTS AGE
dns-controller-2912642664-4w8d4 0/1 Pending 0 1m
etcd-server-events-ip-172-20-52-25.ec2.internal 1/1 Running 0 6m
etcd-server-ip-172-20-52-25.ec2.internal 1/1 Running 0 5m
kube-apiserver-ip-172-20-52-25.ec2.internal 1/1 Running 1 7m
kube-controller-manager-ip-172-20-52-25.ec2.internal 1/1 Running 1 5m
kube-dns-479524115-5mj9m 3/3 Running 0 4m
kube-dns-479524115-ktc5t 3/3 Running 0 6m
kube-dns-autoscaler-1818915203-rx2l6 1/1 Running 0 6m
kube-proxy-ip-172-20-36-149.ec2.internal 1/1 Running 0 4m
kube-proxy-ip-172-20-51-88.ec2.internal 1/1 Running 0 5m
kube-proxy-ip-172-20-52-25.ec2.internal 1/1 Running 0 7m
New master starts but rolling-update times out as it should:
$ kops rolling-update cluster --node-interval 2s --master-interval 2s --yes
Using cluster from kubectl context: rolling-update.aws.k8spro.com
NAME STATUS NEEDUPDATE READY MIN MAX NODES
master-us-east-1c NeedsUpdate 1 0 1 1 1
nodes NeedsUpdate 2 0 2 2 2
I0904 19:56:58.331619 56145 instancegroups.go:269] Draining the node: "ip-172-20-52-25.ec2.internal".
node "ip-172-20-52-25.ec2.internal" cordoned
node "ip-172-20-52-25.ec2.internal" already cordoned
WARNING: Deleting pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: etcd-server-events-ip-172-20-52-25.ec2.internal, etcd-server-ip-172-20-52-25.ec2.internal, kube-apiserver-ip-172-20-52-25.ec2.internal, kube-controller-manager-ip-172-20-52-25.ec2.internal, kube-proxy-ip-172-20-52-25.ec2.internal, kube-scheduler-ip-172-20-52-25.ec2.internal
pod "dns-controller-2912642664-vqs6z" evicted
node "ip-172-20-52-25.ec2.internal" drained
I0904 19:58:30.653936 56145 instancegroups.go:350] Stopping instance "i-0cdd9ee4181767dc9", node "ip-172-20-52-25.ec2.internal", in AWS ASG "master-us-east-1c.masters.rolling-update.aws.k8spro.com".
I0904 19:58:33.124892 56145 instancegroups.go:298] Validating the cluster.
I0904 19:58:33.586672 56145 instancegroups.go:325] Cluster validated.
I0904 19:58:34.422472 56145 instancegroups.go:269] Draining the node: "ip-172-20-36-149.ec2.internal".
node "ip-172-20-36-149.ec2.internal" cordoned
node "ip-172-20-36-149.ec2.internal" already cordoned
WARNING: Deleting pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: kube-proxy-ip-172-20-36-149.ec2.internal
pod "kube-dns-autoscaler-1818915203-rx2l6" evicted
pod "kube-dns-479524115-5mj9m" evicted
node "ip-172-20-36-149.ec2.internal" drained
I0904 20:00:05.379275 56145 instancegroups.go:350] Stopping instance "i-0513281d4d9babedc", node "ip-172-20-36-149.ec2.internal", in AWS ASG "nodes.rolling-update.aws.k8spro.com".
I0904 20:00:07.848531 56145 instancegroups.go:298] Validating the cluster.
I0904 20:00:17.859751 56145 instancegroups.go:322] Cluster did not validate, and waiting longer: cannot get nodes for "rolling-update.aws.k8spro.com": Get https://api.rolling-update.aws.k8spro.com/api/v1/nodes: dial tcp 34.203.223.161:443: getsockopt: operation timed out.
I0904 20:00:28.460135 56145 instancegroups.go:322] Cluster did not validate, and waiting longer: cannot get nodes for "rolling-update.aws.k8spro.com": Get https://api.rolling-update.aws.k8spro.com/api/v1/nodes: dial tcp 34.203.223.161:443: getsockopt: operation timed out.
I0904 20:00:39.008876 56145 instancegroups.go:322] Cluster did not validate, and waiting longer: cannot get nodes for "rolling-update.aws.k8spro.com": Get https://api.rolling-update.aws.k8spro.com/api/v1/nodes: dial tcp 34.203.223.161:443: getsockopt: operation timed out.
I0904 20:00:49.580757 56145 instancegroups.go:322] Cluster did not validate, and waiting longer: cannot get nodes for "rolling-update.aws.k8spro.com": Get https://api.rolling-update.aws.k8spro.com/api/v1/nodes: dial tcp 34.203.223.161:443: getsockopt: operation timed out.
I0904 20:01:00.203131 56145 instancegroups.go:322] Cluster did not validate, and waiting longer: cannot get nodes for "rolling-update.aws.k8spro.com": Get https://api.rolling-update.aws.k8spro.com/api/v1/nodes: dial tcp 34.203.223.161:443: getsockopt: operation timed out.
I0904 20:01:10.806363 56145 instancegroups.go:322] Cluster did not validate, and waiting longer: cannot get nodes for "rolling-update.aws.k8spro.com": Get https://api.rolling-update.aws.k8spro.com/api/v1/nodes: dial tcp 34.203.223.161:443: getsockopt: operation timed out.
I0904 20:01:21.839388 56145 instancegroups.go:322] Cluster did not validate, and waiting longer: cannot get nodes for "rolling-update.aws.k8spro.com": Get https://api.rolling-update.aws.k8spro.com/api/v1/nodes: dial tcp 34.203.223.161:443: getsockopt: operation timed out.
I0904 20:01:32.442137 56145 instancegroups.go:322] Cluster did not validate, and waiting longer: cannot get nodes for "rolling-update.aws.k8spro.com": Get https://api.rolling-update.aws.k8spro.com/api/v1/nodes: dial tcp 34.203.223.161:443: getsockopt: operation timed out.
I0904 20:01:43.099803 56145 instancegroups.go:322] Cluster did not validate, and waiting longer: cannot get nodes for "rolling-update.aws.k8spro.com": Get https://api.rolling-update.aws.k8spro.com/api/v1/nodes: dial tcp 34.203.223.161:443: getsockopt: operation timed out.
error validating cluster after removing a node: cluster validation failed: cannot get nodes for "rolling-update.aws.k8spro.com": Get https://api.rolling-update.aws.k8spro.com/api/v1/nodes: dial tcp 34.203.223.161:443: getsockopt: operation timed out
As mentioned in the cli help:
--validate-retries int The number of times that a node will be validated. Between validation kops sleeps the master-interval/2 or node-interval/2 duration. (default 8)
And we actually did 9 validation not eight. This is the expected behavior.