Skip to content

Instantly share code, notes, and snippets.

@sean-smith
sean-smith / bad-gpu-pc.md
Created May 2, 2024 16:23
Diagnose GPU Failures

Diagnose GPU Failures on ParallelCluster

To diagnose a node with a bad gpu ip-10-1-69-242 on ParallelCluster, do the following:

  1. Run the nvidia reset command where 0 is the device index shown by nvidia-smi of the gpu you want to reset:
srun -w ip-10-1-69-242 sudo nvidia-smi --gpu-reset -i 0
@sean-smith
sean-smith / resize-ebs.md
Created April 25, 2024 19:42
Resize EBS Volume

Run out of EBS space on an ec2 instance?

  1. Make sure the instance has arn:aws:iam::aws:policy/AmazonEC2FullAccess permissions.

  2. Create a script called resize.sh with the following contents:

#!/bin/bash

# Specify the desired volume size in GiB as a command line argument. If not specified, default to 20 GiB.
SIZE=${1:-20}
@sean-smith
sean-smith / install_nccl.md
Last active April 22, 2024 23:20
Install NCCL

Install NCCL on a Cluster

To install on the cluster we'll need to install on all nodes in the /opt/nccl directory. In order to do this we'll create a script and then run it on all nodes using the srun command.

  1. Create a script ./install-nccl.sh : and chmod +x install
#!/bin/bash

# install nccl

Install AWS OFI NCCL

  1. Change into the shared directory
cd /fsx
  1. Create a script install-nccl-aws-ofi.sh to install AWS OFI NCCL:
@sean-smith
sean-smith / torch_distributed.py
Created March 4, 2024 21:03
This is a fork of Meta's torch_distributed.py that works on SageMaker HyperPod
#!/usr/bin/env python
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
#
import os
import sys
#!/bin/bash
# run as root, then validate with:
# chronyc sources -v
# chronyc tracking
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/set-time.html#configure-time-sync
apt install -y chrony
sed -i '/\# See http:\/\/www.pool.ntp.org\/join.html for more information./a server 169.254.169.123 prefer iburst minpoll 4 maxpoll 4\npool time.aws.com iburst' /etc/chrony/chrony.conf
systemctl enable --now chrony
/etc/init.d/chrony restart

Activate virtualenvs with python

  1. Install Virtualenvwrapper - this is my favorite way of creating virtualenvs
sudo apt-get install virtualenvwrapper
  1. Install on the compute as well, where 4 is the number of compute nodes:
@sean-smith
sean-smith / python-3-10.md
Last active January 30, 2024 17:50
Python 3.10 on Hyperpods

Ubuntu 20.04

  1. Create a script install-python.sh with the following content:
#!/bin/bash

sudo apt update 
sudo apt upgrade -y
sudo apt install software-properties-common -y 

Switch Enroot to NVME

  1. Create a file enroot_nvme.sh:
#!/bin/bash

# Change the /etc/enroot/enroot.conf file to use local nvme storage:
#ENROOT_RUNTIME_PATH        /tmp/enroot/user-$(id -u) -> /opt/dlami/nvme/enroot/user-$(id -u)
#ENROOT_CONFIG_PATH         ${HOME}/enroot
#ENROOT_CACHE_PATH          /opt/enroot
@sean-smith
sean-smith / instance-id-slurm.md
Last active March 13, 2024 18:29
Get instance ID to hostname mapping from a Slurm job.

Slurm Get Instance ID to Hostname

Update: you only need the following:

mpirun -N 1 -n 2 bash -c 'echo $(hostname): $(cat /sys/devices/virtual/dmi/id/board_asset_tag | tr -d " ")'
  1. Create a file get-instance-id.sh: