When a GPU fails in a way that does not produce a critical XID error in the kernel (e.g., XIDs 48, 74, 79, 94, 95), the NVIDIA device plugin does not detect the failure. The node continues advertising nvidia.com/gpu: 1, no taint or cordon is applied, and the scheduler keeps sending GPU pods to the broken node. Those pods start, fail with CUDA errors (invalid device ordinal), and exit β but the node remains eligible for more GPU workloads. This document reproduces the issue end-to-end on an OpenShift 4.21 cluster on AWS using a g4dn.xlarge instance (NVIDIA Tesla T4).
The failure was simulated by removing the GPU from the PCI bus:
echo 1 > /sys/bus/pci/devices/0000:00:1e.0/remove