How to Set Kubernetes Resource Requests and Limits - A Saga to Improve Cluster Stability and Efficiency
So, it all started on September 1st, right after our cluster upgrade from 1.11 to 1.12. Almost on the next day, we began to see alerts on kubelet
reported by Datadog. On some days we would get a few (3 - 5) of them, other days we would get more than 10 in a single day. The alert monitor is based on a Datadog check kubernetes.kubelet.check
, and it's triggered whenever the kubelet
process is down in a node.
We know kubelet plays an important role in Kubernetes scheduling. Not having it running properly in a node would directly remove that node from a functional cluster. Having more nodes with problematic kubelet
then we get a cluster degradation. Now, Imagine waking up to