To diagnose a node with a bad gpu ip-10-1-69-242
on ParallelCluster, do the following:
- Run the nvidia reset command where
0
is the device index shown bynvidia-smi
of the gpu you want to reset:
srun -w ip-10-1-69-242 sudo nvidia-smi --gpu-reset -i 0