Skip to content

Instantly share code, notes, and snippets.

@sean-smith
sean-smith / resize-ebs.md
Created April 25, 2024 19:42
Resize EBS Volume

Run out of EBS space on an ec2 instance?

  1. Make sure the instance has arn:aws:iam::aws:policy/AmazonEC2FullAccess permissions.

  2. Create a script called resize.sh with the following contents:

#!/bin/bash

# Specify the desired volume size in GiB as a command line argument. If not specified, default to 20 GiB.
SIZE=${1:-20}
@sean-smith
sean-smith / install_nccl.md
Last active April 22, 2024 23:20
Install NCCL

Install NCCL on a Cluster

To install on the cluster we'll need to install on all nodes in the /opt/nccl directory. In order to do this we'll create a script and then run it on all nodes using the srun command.

  1. Create a script ./install-nccl.sh : and chmod +x install
#!/bin/bash

# install nccl

Install AWS OFI NCCL

  1. Change into the shared directory
cd /fsx
  1. Create a script install-nccl-aws-ofi.sh to install AWS OFI NCCL:
@sean-smith
sean-smith / torch_distributed.py
Created March 4, 2024 21:03
This is a fork of Meta's torch_distributed.py that works on SageMaker HyperPod
#!/usr/bin/env python
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
#
import os
import sys
#!/bin/bash
# run as root, then validate with:
# chronyc sources -v
# chronyc tracking
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/set-time.html#configure-time-sync
apt install -y chrony
sed -i '/\# See http:\/\/www.pool.ntp.org\/join.html for more information./a server 169.254.169.123 prefer iburst minpoll 4 maxpoll 4\npool time.aws.com iburst' /etc/chrony/chrony.conf
systemctl enable --now chrony
/etc/init.d/chrony restart

Activate virtualenvs with python

  1. Install Virtualenvwrapper - this is my favorite way of creating virtualenvs
sudo apt-get install virtualenvwrapper
  1. Install on the compute as well, where 4 is the number of compute nodes:
@sean-smith
sean-smith / python-3-10.md
Last active January 30, 2024 17:50
Python 3.10 on Hyperpods

Ubuntu 20.04

  1. Create a script install-python.sh with the following content:
#!/bin/bash

sudo apt update 
sudo apt upgrade -y
sudo apt install software-properties-common -y 

Switch Enroot to NVME

  1. Create a file enroot_nvme.sh:
#!/bin/bash

# Change the /etc/enroot/enroot.conf file to use local nvme storage:
#ENROOT_RUNTIME_PATH        /tmp/enroot/user-$(id -u) -> /opt/dlami/nvme/enroot/user-$(id -u)
#ENROOT_CONFIG_PATH         ${HOME}/enroot
#ENROOT_CACHE_PATH          /opt/enroot
@sean-smith
sean-smith / instance-id-slurm.md
Last active March 13, 2024 18:29
Get instance ID to hostname mapping from a Slurm job.

Slurm Get Instance ID to Hostname

Update: you only need the following:

mpirun -N 1 -n 2 bash -c 'echo $(hostname): $(cat /sys/devices/virtual/dmi/id/board_asset_tag | tr -d " ")'
  1. Create a file get-instance-id.sh:
@sean-smith
sean-smith / enroot_pyxis.sh
Last active January 3, 2024 00:10
Installs Enroot and Pyxis (+optional hooks) on ParallelCluster
#!/bin/bash
# Copyright 2021 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License"). You may not use this file except in compliance
# with the License. A copy of the License is located at
#
# http://aws.amazon.com/apache2.0/
#
# or in the "LICENSE.txt" file accompanying this file. This file is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES
# OR CONDITIONS OF ANY KIND, express or implied. See the License for the specific language governing permissions and