Skip to content

Instantly share code, notes, and snippets.

@sean-smith
Created May 2, 2024 16:23
Show Gist options
  • Save sean-smith/37008be16c0a257e8d370a3e6a71f0c6 to your computer and use it in GitHub Desktop.
Save sean-smith/37008be16c0a257e8d370a3e6a71f0c6 to your computer and use it in GitHub Desktop.
Diagnose GPU Failures

Diagnose GPU Failures on ParallelCluster

To diagnose a node with a bad gpu ip-10-1-69-242 on ParallelCluster, do the following:

  1. Run the nvidia reset command where 0 is the device index shown by nvidia-smi of the gpu you want to reset:
srun -w ip-10-1-69-242 sudo nvidia-smi --gpu-reset -i 0
  1. If that doesn't success then generate a bug report:
srun -w ip-10-1-69-242 nvidia-bug-report.sh
  1. Grab the instance id:
srun -w ip-10-1-69-242 cat /sys/devices/virtual/dmi/id/board_asset_tag | tr -d " "
  1. Grab the output of nvidia-bug-report.sh and replace that instance where <instance-id> is the instance id from above.
aws ec2 terminate-instances \
    --instance-ids <instance-id>
  1. ParallelCluster will re-launch the instance and you'll see a new instance come up in the EC2 console.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment