neolit123/testing-ha-kinder.md

## testing-ha-kinder.md

      
    Raw
  

              testing-ha-kinder.md
            
          
    testing concurrent join failures with kinder

this is a short guide for testing concurrent HA cluster join with kinder.
currently it requires patching both kubeadm and kinder.
you need:

go 1.12+
docker 18.0{6|9} (known to work)
clones of kuberentes/kubeadm and kubernetes/kubernetes
a host machine with a good chunk of RAM and 2 CPU cores

PR to add retry for the kubeadm etcd client's MemberAdd.
neolit123/kubernetes#2
this patch fixes errors when the etcd clusters attempts to grow concurrently.
you can choose not to apply this yet if you wish to see the actual errors.
cd kuberenetes
# apply patch
wget https://github.com/neolit123/kubeadm/pull/1.diff
git apply 1.diff
PR to enable kinder do with --parallel:
neolit123/kubeadm#1
this allows concurrent operations like kubeadm join on multiple nodes at once.
cd kubeadm
# apply patch
wget https://github.com/neolit123/kubernetes/pull/2.diff
git apply 2.diff
to build kinder:
# build kinder
cd kubeadm
GO111MODULE=on go build
# resulting binary is `kinder`
# symlink or add to PATH
each time you make kubernetes changes (e.g. patching kubeadm) you need to build a new kind(er) node-image:
cd kubernetes
kinder build node-image --image kinder/node:latest
to create a kinder cluster (node provisioning):
kinder create cluster --image kinder/node:latest --control-plane-nodes 3 --worker-nodes 3
# adjust the number of CP and W nodes if needed
note that you if start getting api-server pod crashes the host might be out of RAM.
16GB works for a 3CP,3W setup in my testing, but i'm getting errors if i try 4CP.
probably a good idea to monitor the memory usage.
to init the primary control plane and join the rest of the nodes:
# init primary CP
kinder do kubeadm-init

# set the KUBECONFIG to the kinder cluster
# technically you can do this once as long as the cluster re-uses the same name (by default "kind")
export KUBECONFIG=$(kinder get kubeconfig-path)

# join the worker and CP nodes.
kinder do kubeadm-join --parallel
in my testing i tried firing the join process right after init (without waiting) to stress test the components.
it holds up pretty well.
note that --parallel dumps the output of the joining nodes all at once and the stdout of those will be mixed.
in a separate tab i have kubectl get po --all-namespaces to watch for failing pods.
without the kubeadm patch a CP node join can flake with:
error execution phase control-plane-join/etcd: error creating local etcd static pod manifest file: etcdserver: unhealthy cluster
with or without the kubeadm patch a worker node can flake with:
error execution phase kubelet-start: error uploading crisocket: error patching node "kind-worker3" through apiserver: etcdserver: request timed out
to destroy the cluster:
kinder delete cluster
you can also try to reset the nodes instead of re-creating the cluster:
kinder do kubeadm-reset
more docs on kinder:
https://github.com/kubernetes/kubeadm/blob/master/kinder/doc/test-HA.md
testing join failures with kind

there is a separate issue (the one we call "the configmap/load-balancer" issue) where kind serial CP node join can fail.
kubernetes-sigs/kind#588
a couple of differences between kind and kinder:

kind uses nginx as the LB and also containerd as the CR on the nodes.
kinder uses haproxy as the LB and docker on the nodes.

write this config to a file called config.yaml:
# a cluster with 3 control-planes and 3 workers
kind: Cluster
apiVersion: kind.sigs.k8s.io/v1alpha3
nodes:
- role: control-plane
- role: control-plane
- role: control-plane
- role: worker
- role: worker
- role: worker
the same system specs as the above kinder testing scenarios are required.
clone kuberentes-sigs/kind and kubernetes/kubernetes.
build kind and a node image:
cd kind
GO111MODULE=on go build
# install the kind binary to PATH or symlink
cd kubernetes
kind build node-image --kube-root=$(pwd)
create a cluster:
kind create cluster --config=<path-to-config.yaml> --image kindest/node:latest
this can flake with the following error:
I0604 19:15:10.310249     760 round_trippers.go:438] GET https://172.17.0.2:6443/api/v1/namespaces/kube-system/configmaps/kubeadm-config 401 Unauthorized in 1233 milliseconds
error execution phase preflight: unable to fetch the kubeadm-config ConfigMap: failed to get config map: Unauthorized 
to delete the cluster:
kind delete cluster