Skip to content

Instantly share code, notes, and snippets.

@superseb
Last active February 7, 2024 09:08
Show Gist options
  • Star 8 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save superseb/a9925c465b42bc5001b94c4ec241265a to your computer and use it in GitHub Desktop.
Save superseb/a9925c465b42bc5001b94c4ec241265a to your computer and use it in GitHub Desktop.
Rancher 2.x custom cluster YAML quicker node failure detection (k8s 1.13)
services:
kubelet:
extra_args:
node-status-update-frequency: 4s
kube-api:
extra_args:
default-not-ready-toleration-seconds: 30
default-unreachable-toleration-seconds: 30
kube-controller:
extra_args:
node-monitor-period: 2s
node-monitor-grace-period: 16s
pod-eviction-timeout: 30s
@voarsh2
Copy link

voarsh2 commented Jun 4, 2021

Very helpful.

@Whitespirit0
Copy link

Interesting article to understand the flow of events: Kubernetes Tip: How To Make Kubernetes React Faster When Nodes Fail?

@Florianisme
Copy link

We have created an RKE2 Cluster from Rancher. Where do we paste this config in?
Is it under Cluster Management -> Edit as YAML in Rancher? I haven't seen this cluster.yml structure anywhere in our cluster

@voarsh2
Copy link

voarsh2 commented Feb 8, 2023

We have created an RKE2 Cluster from Rancher. Where do we paste this config in? Is it under Cluster Management -> Edit as YAML in Rancher? I haven't seen this cluster.yml structure anywhere in our cluster

I believe this is for RKE1

@superseb
Copy link
Author

superseb commented Feb 9, 2023

@voarsh2
Copy link

voarsh2 commented Feb 18, 2023

This is indeed for RKE1, RKE2 configuration can be found here https://docs.rke2.io/reference/server_config and via Rancher https://ranchermanager.docs.rancher.com/reference-guides/cluster-configuration/rancher-server-configuration/rke2-cluster-configuration

Haha, yeah, I guessed it wouldn't work. I was hoping for more of a "just add this to kubelete arguments"?

Also RKE 1 had a nice increase max pods to like 500 per node? From those links I can't work out how to increase the pod limit - far too conservative. I run out of pods way before server resources..... Only got a measly 110 per node........

@superseb
Copy link
Author

Here is a Rancher RKE2 example

spec:
  rkeConfig:
    machineGlobalConfig:
      kube-apiserver-arg:
        - '--default-not-ready-toleration-seconds=30'
        - '--default-unreachable-toleration-seconds=30'
      kube-controller-manager-arg:
        - '--node-monitor-period=2s'
        - '--node-monitor-grace-period=16s'
        - '--pod-eviction-timeout=30s'
    machineSelectorConfig:
      - config:
          kubelet-arg:
            - '--node-status-update-frequency=4s'
            - '--max-pods=200'

@voarsh2
Copy link

voarsh2 commented Aug 3, 2023

spec:
rkeConfig:
machineGlobalConfig:
kube-apiserver-arg:
- '--default-not-ready-toleration-seconds=30'
- '--default-unreachable-toleration-seconds=30'
kube-controller-manager-arg:
- '--node-monitor-period=2s'
- '--node-monitor-grace-period=16s'
- '--pod-eviction-timeout=30s'
machineSelectorConfig:
- config:
kubelet-arg:
- '--node-status-update-frequency=4s'
- '--max-pods=200'

Thanks @superseb

@patan32
Copy link

patan32 commented Jan 10, 2024

Here is a Rancher RKE2 example

spec:
  rkeConfig:
    machineGlobalConfig:
      kube-apiserver-arg:
        - '--default-not-ready-toleration-seconds=30'
        - '--default-unreachable-toleration-seconds=30'
      kube-controller-manager-arg:
        - '--node-monitor-period=2s'
        - '--node-monitor-grace-period=16s'
        - '--pod-eviction-timeout=30s'
    machineSelectorConfig:
      - config:
          kubelet-arg:
            - '--node-status-update-frequency=4s'
            - '--max-pods=200'

Hello,

I am wondering how i can apply this to my RKE2 Cluster? When i go to the cluster in rancher i can't see edit yaml button. Any help is appreciated.

image

@superseb
Copy link
Author

@patan32 Probably want to check rancher/rancher#43918, depending on what versions you are using it could be old/new chosen behavior or a new bug.

@Zappelphilipp
Copy link

Zappelphilipp commented Feb 6, 2024

Here is a Rancher RKE2 example

spec:
  rkeConfig:
    machineGlobalConfig:
      kube-apiserver-arg:
        - '--default-not-ready-toleration-seconds=30'
        - '--default-unreachable-toleration-seconds=30'
      kube-controller-manager-arg:
        - '--node-monitor-period=2s'
        - '--node-monitor-grace-period=16s'
        - '--pod-eviction-timeout=30s'
    machineSelectorConfig:
      - config:
          kubelet-arg:
            - '--node-status-update-frequency=4s'
            - '--max-pods=200'

Tried on my Rancher RKE2 based cluster - can not recommend - did crash my master nodes or at least did not want to apply the settings. master nodes stuck on "waiting for kube-controller". the failed nodes told me:

journalctl -xeu rke2-server.service
Feb 06 16:22:33 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: time="2024-02-06T16:22:33+01:00" level=info msg="Reconciling ETCDSnapshotFile resources"
Feb 06 16:22:33 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: time="2024-02-06T16:22:33+01:00" level=info msg="Tunnel server egress proxy mode: agent"
Feb 06 16:22:33 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: time="2024-02-06T16:22:33+01:00" level=info msg="Starting managed etcd node metadata controller"
Feb 06 16:22:33 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: time="2024-02-06T16:22:33+01:00" level=info msg="Reconciliation of ETCDSnapshotFile resources complete"
Feb 06 16:22:33 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: time="2024-02-06T16:22:33+01:00" level=info msg="Starting k3s.cattle.io/v1, Kind=Addon controller"
Feb 06 16:22:33 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: time="2024-02-06T16:22:33+01:00" level=info msg="Creating deploy event broadcaster"
Feb 06 16:22:33 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: time="2024-02-06T16:22:33+01:00" level=info msg="Starting /v1, Kind=Node controller"
Feb 06 16:22:33 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: time="2024-02-06T16:22:33+01:00" level=info msg="Cluster dns configmap already exists"
Feb 06 16:22:33 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: time="2024-02-06T16:22:33+01:00" level=info msg="Labels and annotations have been set successfully on node: rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb"
Feb 06 16:22:33 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: time="2024-02-06T16:22:33+01:00" level=info msg="Starting /v1, Kind=Secret controller"
Feb 06 16:22:33 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: time="2024-02-06T16:22:33+01:00" level=info msg="Updating TLS secret for kube-system/rke2-serving (count: 16): map[listener.cattle.io/cn-10.11.55.170:10.11.55.170 listener.cattle.io/cn->
Feb 06 16:22:36 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: time="2024-02-06T16:22:36+01:00" level=info msg="Running kube-proxy --cluster-cidr=10.42.0.0/16 --conntrack-max-per-core=0 --conntrack-tcp-timeout-close-wait=0s --conntrack-tcp-timeout->
Feb 06 16:25:52 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: 2024/02/06 16:25:52 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
Feb 06 16:28:52 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: 2024/02/06 16:28:52 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".
Feb 06 16:32:12 rocky-v10-pool3-rocky-prod-v1-feb2ae08-6h8wb rke2[905139]: 2024/02/06 16:32:12 ERROR: [transport] Client received GoAway with error code ENHANCE_YOUR_CALM and debug data equal to ASCII "too_many_pings".

EDIT: found the problem: pod-eviction-timeout was deprecated in 1.25 (kubernetes/website#39681).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment