Setup Mountpoint CSI driver.

First setup mountpoint following the instructions in the docs.

Steps to setup nvme w/ mountpoint:

Next we'll tell S3 Mountpoint to cache on the 28TB of local NVME available on each P5 instance.

  1. Mount the nvme disks as a single mount - this needs to be done on each p5 instance:
Diagnose GPU Failures

Diagnose GPU Failures on ParallelCluster

To diagnose a node with a bad gpu ip-10-1-69-242 on ParallelCluster, do the following:

  1. Run the nvidia reset command where 0 is the device index shown by nvidia-smi of the gpu you want to reset:
srun -w ip-10-1-69-242 sudo nvidia-smi --gpu-reset -i 0
Resize EBS Volume

Run out of EBS space on an ec2 instance?

  1. Make sure the instance has arn:aws:iam::aws:policy/AmazonEC2FullAccess permissions.

  2. Create a script called with the following contents:


# Specify the desired volume size in GiB as a command line argument. If not specified, default to 20 GiB.
Install NCCL

Install NCCL on a Cluster

To install on the cluster we'll need to install on all nodes in the /opt/nccl directory. In order to do this we'll create a script and then run it on all nodes using the srun command.

  1. Create a script ./ : and chmod +x install

# install nccl


  1. Change into the shared directory
cd /fsx
  1. Create a script to install AWS OFI NCCL:
This is a fork of Meta's that works on SageMaker HyperPod
#!/usr/bin/env python
# Copyright (c) Facebook, Inc. and its affiliates.
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
import os
import sys
# run as root, then validate with:
# chronyc sources -v
# chronyc tracking
# see
apt install -y chrony
sed -i '/\# See http:\/\/\/join.html for more information./a server prefer iburst minpoll 4 maxpoll 4\npool iburst' /etc/chrony/chrony.conf
systemctl enable --now chrony
/etc/init.d/chrony restart

Activate virtualenvs with python

  1. Install Virtualenvwrapper - this is my favorite way of creating virtualenvs
sudo apt-get install virtualenvwrapper
  1. Install on the compute as well, where 4 is the number of compute nodes:
Python 3.10 on Hyperpods

Ubuntu 20.04

  1. Create a script with the following content:

sudo apt update 
sudo apt upgrade -y
sudo apt install software-properties-common -y 

Switch Enroot to NVME

  1. Create a file

# Change the /etc/enroot/enroot.conf file to use local nvme storage:
#ENROOT_RUNTIME_PATH        /tmp/enroot/user-$(id -u) -> /opt/dlami/nvme/enroot/user-$(id -u)
#ENROOT_CONFIG_PATH         ${HOME}/enroot
#ENROOT_CACHE_PATH          /opt/enroot