-
-
Save superseb/a9925c465b42bc5001b94c4ec241265a to your computer and use it in GitHub Desktop.
services: | |
kubelet: | |
extra_args: | |
node-status-update-frequency: 4s | |
kube-api: | |
extra_args: | |
default-not-ready-toleration-seconds: 30 | |
default-unreachable-toleration-seconds: 30 | |
kube-controller: | |
extra_args: | |
node-monitor-period: 2s | |
node-monitor-grace-period: 16s | |
pod-eviction-timeout: 30s |
This is indeed for RKE1, RKE2 configuration can be found here https://docs.rke2.io/reference/server_config and via Rancher https://ranchermanager.docs.rancher.com/reference-guides/cluster-configuration/rancher-server-configuration/rke2-cluster-configuration
Haha, yeah, I guessed it wouldn't work. I was hoping for more of a "just add this to kubelete arguments"?
Also RKE 1 had a nice increase max pods to like 500 per node? From those links I can't work out how to increase the pod limit - far too conservative. I run out of pods way before server resources..... Only got a measly 110 per node........
Here is a Rancher RKE2 example
spec:
rkeConfig:
machineGlobalConfig:
kube-apiserver-arg:
- '--default-not-ready-toleration-seconds=30'
- '--default-unreachable-toleration-seconds=30'
kube-controller-manager-arg:
- '--node-monitor-period=2s'
- '--node-monitor-grace-period=16s'
- '--pod-eviction-timeout=30s'
machineSelectorConfig:
- config:
kubelet-arg:
- '--node-status-update-frequency=4s'
- '--max-pods=200'
spec:
rkeConfig:
machineGlobalConfig:
kube-apiserver-arg:
- '--default-not-ready-toleration-seconds=30'
- '--default-unreachable-toleration-seconds=30'
kube-controller-manager-arg:
- '--node-monitor-period=2s'
- '--node-monitor-grace-period=16s'
- '--pod-eviction-timeout=30s'
machineSelectorConfig:
- config:
kubelet-arg:
- '--node-status-update-frequency=4s'
- '--max-pods=200'
Thanks @superseb
Here is a Rancher RKE2 example
spec: rkeConfig: machineGlobalConfig: kube-apiserver-arg: - '--default-not-ready-toleration-seconds=30' - '--default-unreachable-toleration-seconds=30' kube-controller-manager-arg: - '--node-monitor-period=2s' - '--node-monitor-grace-period=16s' - '--pod-eviction-timeout=30s' machineSelectorConfig: - config: kubelet-arg: - '--node-status-update-frequency=4s' - '--max-pods=200'
Hello,
I am wondering how i can apply this to my RKE2 Cluster? When i go to the cluster in rancher i can't see edit yaml button. Any help is appreciated.
@patan32 Probably want to check rancher/rancher#43918, depending on what versions you are using it could be old/new chosen behavior or a new bug.
Here is a Rancher RKE2 example
spec: rkeConfig: machineGlobalConfig: kube-apiserver-arg: - '--default-not-ready-toleration-seconds=30' - '--default-unreachable-toleration-seconds=30' kube-controller-manager-arg: - '--node-monitor-period=2s' - '--node-monitor-grace-period=16s' - '--pod-eviction-timeout=30s' machineSelectorConfig: - config: kubelet-arg: - '--node-status-update-frequency=4s' - '--max-pods=200'
Tried on my Rancher RKE2 based cluster - can not recommend - did crash my master nodes or at least did not want to apply the settings. master nodes stuck on "waiting for kube-controller". the failed nodes told me:
journalctl -xeu rke2-server.service
Feb 06 16:22:33 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: time="2024-02-06T16:22:33+01:00" level=info msg="Reconciling ETCDSnapshotFile resources"
Feb 06 16:22:33 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: time="2024-02-06T16:22:33+01:00" level=info msg="Tunnel server egress proxy mode: agent"
Feb 06 16:22:33 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: time="2024-02-06T16:22:33+01:00" level=info msg="Starting managed etcd node metadata controller"
Feb 06 16:22:33 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: time="2024-02-06T16:22:33+01:00" level=info msg="Reconciliation of ETCDSnapshotFile resources complete"
Feb 06 16:22:33 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: time="2024-02-06T16:22:33+01:00" level=info msg="Starting k3s.cattle.io/v1, Kind=Addon controller"
Feb 06 16:22:33 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: time="2024-02-06T16:22:33+01:00" level=info msg="Creating deploy event broadcaster"
Feb 06 16:22:33 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: time="2024-02-06T16:22:33+01:00" level=info msg="Starting /v1, Kind=Node controller"
Feb 06 16:22:33 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: time="2024-02-06T16:22:33+01:00" level=info msg="Cluster dns configmap already exists"
Feb 06 16:22:33 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: time="2024-02-06T16:22:33+01:00" level=info msg="Labels and annotations have been set successfully on node: rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb"
Feb 06 16:22:33 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: time="2024-02-06T16:22:33+01:00" level=info msg="Starting /v1, Kind=Secret controller"
Feb 06 16:22:33 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: time="2024-02-06T16:22:33+01:00" level=info msg="Updating TLS secret for kube-system/rke2-serving (count: 16): map[listener.cattle.io/cn-10.11.55.170:10.11.55.170 listener.cattle.io/cn->
Feb 06 16:22:36 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: time="2024-02-06T16:22:36+01:00" level=info msg="Running kube-proxy --cluster-cidr=10.42.0.0/16 --conntrack-max-per-core=0 --conntrack-tcp-timeout-close-wait=0s --conntrack-tcp-timeout->
Feb 06 16:25:52 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: 2024/02/06 16:25:52 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
Feb 06 16:28:52 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: 2024/02/06 16:28:52 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
Feb 06 16:32:12 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: 2024/02/06 16:32:12 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
EDIT: found the problem: pod-eviction-timeout
was deprecated in 1.25 (kubernetes/website#39681).
This is indeed for RKE1, RKE2 configuration can be found here https://docs.rke2.io/reference/server_config and via Rancher https://ranchermanager.docs.rancher.com/reference-guides/cluster-configuration/rancher-server-configuration/rke2-cluster-configuration