The following Slurm Prolog starts the CUDA MPS server on each compute node before the job is started.
cat << EOF > /opt/slurm/etc/prolog.sh
#!/bin/sh
# start mps
nvidia-cuda-mps-control -d
The following Slurm Prolog starts the CUDA MPS server on each compute node before the job is started.
cat << EOF > /opt/slurm/etc/prolog.sh
#!/bin/sh
# start mps
nvidia-cuda-mps-control -d
FSx Netapp is a multi-protocol filesystem. It mounts on Windows as SMB, Linux as NFS and Mac. This allows cluster users to bridge their Windows and Linux machines with the same filesystem, potentially running both windows and linux machines for a post-processing workflow.
Pros
First setup mountpoint following the instructions in the docs.
Next we'll tell S3 Mountpoint to cache on the 28TB of local NVME available on each P5 instance.
Make sure the instance has arn:aws:iam::aws:policy/AmazonEC2FullAccess
permissions.
Create a script called resize.sh
with the following contents:
#!/bin/bash
# Specify the desired volume size in GiB as a command line argument. If not specified, default to 20 GiB.
SIZE=${1:-20}
cd /fsx
install-nccl-aws-ofi.sh
to install AWS OFI NCCL:#!/usr/bin/env python | |
# Copyright (c) Facebook, Inc. and its affiliates. | |
# | |
# This source code is licensed under the MIT license found in the | |
# LICENSE file in the root directory of this source tree. | |
# | |
import os | |
import sys |
#!/bin/bash | |
# run as root, then validate with: | |
# chronyc sources -v | |
# chronyc tracking | |
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/set-time.html#configure-time-sync | |
apt install -y chrony | |
sed -i '/\# See http:\/\/www.pool.ntp.org\/join.html for more information./a server 169.254.169.123 prefer iburst minpoll 4 maxpoll 4\npool time.aws.com iburst' /etc/chrony/chrony.conf | |
systemctl enable --now chrony | |
/etc/init.d/chrony restart |