Skip to content

Instantly share code, notes, and snippets.

@sean-smith
sean-smith / slurm-mps-prolog.md
Last active August 27, 2024 05:40
Start CUDA MPS Server on each node

👾 Slurm CUDA MPS Prolog

The following Slurm Prolog starts the CUDA MPS server on each compute node before the job is started.

cat << EOF > /opt/slurm/etc/prolog.sh
#!/bin/sh

# start mps
nvidia-cuda-mps-control -d

Mount FSx Netapp ONTAP with AWS ParallelCluster

FSx Netapp is a multi-protocol filesystem. It mounts on Windows as SMB, Linux as NFS and Mac. This allows cluster users to bridge their Windows and Linux machines with the same filesystem, potentially running both windows and linux machines for a post-processing workflow.

Screen Shot 2022-03-07 at 5 29 23 PM

Pros

  • Multi-Protocol
  • Hybrid support
  • Multi-AZ (for High Availibility)

Setup Mountpoint CSI driver.

First setup mountpoint following the instructions in the docs.

Steps to setup nvme w/ mountpoint:

Next we'll tell S3 Mountpoint to cache on the 28TB of local NVME available on each P5 instance.

  1. Mount the nvme disks as a single mount - this needs to be done on each p5 instance:
@sean-smith
sean-smith / bad-gpu-pc.md
Created May 2, 2024 16:23
Diagnose GPU Failures

Diagnose GPU Failures on ParallelCluster

To diagnose a node with a bad gpu ip-10-1-69-242 on ParallelCluster, do the following:

  1. Run the nvidia reset command where 0 is the device index shown by nvidia-smi of the gpu you want to reset:
srun -w ip-10-1-69-242 sudo nvidia-smi --gpu-reset -i 0
@sean-smith
sean-smith / resize-ebs.md
Created April 25, 2024 19:42
Resize EBS Volume

Run out of EBS space on an ec2 instance?

  1. Make sure the instance has arn:aws:iam::aws:policy/AmazonEC2FullAccess permissions.

  2. Create a script called resize.sh with the following contents:

#!/bin/bash

# Specify the desired volume size in GiB as a command line argument. If not specified, default to 20 GiB.
SIZE=${1:-20}

Install AWS OFI NCCL

  1. Change into the shared directory
cd /fsx
  1. Create a script install-nccl-aws-ofi.sh to install AWS OFI NCCL:
@sean-smith
sean-smith / install_nccl.md
Last active April 22, 2024 23:20
Install NCCL

Install NCCL on a Cluster

To install on the cluster we'll need to install on all nodes in the /opt/nccl directory. In order to do this we'll create a script and then run it on all nodes using the srun command.

  1. Create a script ./install-nccl.sh : and chmod +x install
#!/bin/bash

# install nccl
@sean-smith
sean-smith / instance-id-slurm.md
Last active March 13, 2024 18:29
Get instance ID to hostname mapping from a Slurm job.

Slurm Get Instance ID to Hostname

Update: you only need the following:

mpirun -N 1 -n 2 bash -c 'echo $(hostname): $(cat /sys/devices/virtual/dmi/id/board_asset_tag | tr -d " ")'
  1. Create a file get-instance-id.sh:
@sean-smith
sean-smith / torch_distributed.py
Created March 4, 2024 21:03
This is a fork of Meta's torch_distributed.py that works on SageMaker HyperPod
#!/usr/bin/env python
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
#
import os
import sys
#!/bin/bash
# run as root, then validate with:
# chronyc sources -v
# chronyc tracking
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/set-time.html#configure-time-sync
apt install -y chrony
sed -i '/\# See http:\/\/www.pool.ntp.org\/join.html for more information./a server 169.254.169.123 prefer iburst minpoll 4 maxpoll 4\npool time.aws.com iburst' /etc/chrony/chrony.conf
systemctl enable --now chrony
/etc/init.d/chrony restart