Sean Smith sean-smith

## raspberry-pi-image.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                sean-smith
                / raspberry-pi-image.md
            
            
              Last active
              December 21, 2024 03:19
            
              
                Create an image for raspberry pi on MacOS
              
          
    Copy Raspberry ISO on MacOS


based on https://opensource.com/article/21/7/custom-raspberry-pi-image


First image the raspberry pi and make any changes you need via SSH. When you're done make sure to install cockpit:

sudo apt install cockpit

  
## g6e-p5.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                sean-smith
                / g6e-p5.md
            
            
              Last active
              October 17, 2024 00:27
            
          
    Cross EC2/Hyperpod cluster


In this guide we'll show you how to launch g6e instances in EC2 and connect to a Hyperpod cluster via ssh. These instances will live in the same AZ and mount the same filesystems as the cluster. In this guide we assume you already have a Hyperpod cluster you created by following the workshop content.


We need to launch our g6e instances in the same VPC as your filesystem and in the Public Subnet we created in the initial Cloudformation template so you can SSH directly into the host using the public ip/hostname.

Make sure you're in your Local Environment:
exit # exit the cluster to local environment

  
## s3-mountpoint-eks.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                sean-smith
                / s3-mountpoint-eks.md
            
            
              Last active
              May 7, 2024 23:55
            
          
    Setup Mountpoint CSI driver.

First setup mountpoint following the instructions in the docs.
Steps to setup nvme w/ mountpoint:

Next we'll tell S3 Mountpoint to cache on the 28TB of local NVME available on each P5 instance.

Mount the nvme disks as a single mount - this needs to be done on each p5 instance:


## bad-gpu-pc.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                sean-smith
                / bad-gpu-pc.md
            
            
              Created
              May 2, 2024 16:23
            
              
                Diagnose GPU Failures
              
          
    Diagnose GPU Failures on ParallelCluster

To diagnose a node with a bad gpu ip-10-1-69-242 on ParallelCluster, do the following:

Run the nvidia reset command where 0 is the device index shown by nvidia-smi of the gpu you want to reset:

srun -w ip-10-1-69-242 sudo nvidia-smi --gpu-reset -i 0

  
## resize-ebs.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                sean-smith
                / resize-ebs.md
            
            
              Created
              April 25, 2024 19:42
            
              
                Resize EBS Volume
              
          
    Run out of EBS space on an ec2 instance?


Make sure the instance has arn:aws:iam::aws:policy/AmazonEC2FullAccess permissions.


Create a script called resize.sh with the following contents:


#!/bin/bash

# Specify the desired volume size in GiB as a command line argument. If not specified, default to 20 GiB.
SIZE=${1:-20}

  
## install_nccl.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                sean-smith
                / install_nccl.md
            
            
              Last active
              April 22, 2024 23:20
            
              
                Install NCCL
              
          
    Install NCCL on a Cluster

To install on the cluster we'll need to install on all nodes in the /opt/nccl directory. In order to do this we'll create a script and then run it on all nodes using the srun command.

Create a script ./install-nccl.sh : and chmod +x install

#!/bin/bash

# install nccl

  
## install-aws-ofi-nccl.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                sean-smith
                / install-aws-ofi-nccl.md
            
            
              Last active
              April 22, 2024 23:22
            
          
    Install AWS OFI NCCL


Change into the shared directory

cd /fsx

Create a script install-nccl-aws-ofi.sh to install AWS OFI NCCL:


## torch_distributed.py
#!/usr/bin/env python
# Copyright (c) Facebook, Inc. and its affiliates.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.
#


import os
import sys

## set-amazon-ntp.sh
#!/bin/bash
# run as root, then validate with:
# chronyc sources -v
# chronyc tracking
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/set-time.html#configure-time-sync

apt install -y chrony
sed -i '/\# See http:\/\/www.pool.ntp.org\/join.html for more information./a server 169.254.169.123 prefer iburst minpoll 4 maxpoll 4\npool time.aws.com iburst' /etc/chrony/chrony.conf
systemctl enable --now chrony
/etc/init.d/chrony restart

## virtualenv.md

      
              1 file
            
          
              0 forks
            
          
                0 comments
              
            
              0 stars
            
          
                sean-smith
                / virtualenv.md
            
            
              Last active
              February 8, 2024 02:00
            
          
    Activate virtualenvs with python


Install Virtualenvwrapper - this is my favorite way of creating virtualenvs

sudo apt-get install virtualenvwrapper

Install on the compute as well, where 4 is the number of compute nodes:
	#!/usr/bin/env python
	# Copyright (c) Facebook, Inc. and its affiliates.
	#
	# This source code is licensed under the MIT license found in the
	# LICENSE file in the root directory of this source tree.
	#


	import os
	import sys
	#!/bin/bash
	# run as root, then validate with:
	# chronyc sources -v
	# chronyc tracking
	# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/set-time.html#configure-time-sync

	apt install -y chrony
	sed -i '/\# See http:\/\/www.pool.ntp.org\/join.html for more information./a server 169.254.169.123 prefer iburst minpoll 4 maxpoll 4\npool time.aws.com iburst' /etc/chrony/chrony.conf
	systemctl enable --now chrony
	/etc/init.d/chrony restart