- Detect permanent node problems and set Node Conditions using the Node Problem Detector.
- Configure Draino to cordon and drain nodes when they exhibit the NPD's KernelDeadlock condition, or a variant of KernelDeadlock we call VolumeTaskHung.
- Let the Cluster Autoscaler scale down underutilised nodes, including the nodes Draino has drained.
Note: Draino will log nothing, and export no metrics until it actually drains a node.
Once the Descheduler supports descheduling pods based on taints, Draino could be replaced by the Descheduler running in combination with the scheduler's TaintNodesByCondition
functionality.
See kubernetes-sigs/descheduler#131