This is a short-writeup to explain how to verify the fix to Issue 200 reported for the NVIDIA device plugin: NVIDIA/k8s-device-plugin#200
This issue happens when the NVIDIA device plugin is configured to allow only privileged access to all GPUs to containers (rather than allow unprivileged containers from getting access to GPUs that the container did not request). A detailed write up on this aspect is described here.
Issue #200 is specifically observed on IaaS cloud where VMs could be stopped and then restarted - any pods that had GPUs assigned can fail since in a cloud environment, different physical GPUs could be attached to VMs on restart. The issue was that the device plugin only supported enumerating GPUs to containers using UUIDs (which are unique), but these can change when VMs are restarted. The fix was to add a new option called deviceIDStartegy
to the plugin to allow