Skip to content

Instantly share code, notes, and snippets.

@sean-smith
Last active October 20, 2022 14:11
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sean-smith/cd6547604d0bd2a56c40c04584034036 to your computer and use it in GitHub Desktop.
Save sean-smith/cd6547604d0bd2a56c40c04584034036 to your computer and use it in GitHub Desktop.

Dynamic Filesystems with AWS ParallelCluster

You can dynamically create a filesystem per-job, this is useful for jobs that require a fast filesystem but don't want to pay to have the filesystem running 24/7. It's also useful to create a filesystem per-job.

In order to accomplish this without wasting time waiting for the filesystem to create (~15 mins), we've seperated this into three seperate jobs:

  1. Create filesystem, only needs a single EC2 instance to run, can be run on head node. Takes 8-15 minutes.
  2. Start job, this first mounts the filesystem before executing the job.
  3. Delete filesystem

Jobs mount the filesystem under:

/fsx/$PROJECT_NAME

This allows mounting multiple filesystems on the same cluster, one for each job or project.

Setup

0. Create a Cluster

First we'll create a cluster with the arn:aws:iam::aws:policy/AmazonFSxFullAccess IAM policy.

To do so include the IAM policy under the HeadNode > Advanced options > IAM Policies:

ParallelClusterManager

fsx_policy

CLI

  Iam:
    AdditionalIamPolicies:
      - Policy: arn:aws:iam::aws:policy/AmazonFSxFullAccess

You'll need to do the same for the ComputeNodes section.

1. create-filesystem.sbatch script

First create a script responsible for provisioning and waiting for the filesystem to get created:

#!/bin/bash
#SBATCH -n 1
#SBATCH --time=00:30:00 # fail if filesystem takes more than 30 mins to create

PROJECT_NAME=$1

# get subnet
INTERFACE=$(curl --silent http://169.254.169.254/latest/meta-data/network/interfaces/macs/)
SUBNET_ID=$(curl --silent http://169.254.169.254/latest/meta-data/network/interfaces/macs/${INTERFACE}/subnet-id)
AZ=$(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone)
REGION=${AZ::-1}

# create filesystem
filesystem_id=$(aws fsx --region $REGION create-file-system --file-system-type LUSTRE --storage-capacity 1200 --subnet-ids $SUBNET_ID --lustre-configuration DeploymentType=SCRATCH_2 --query "FileSystem.FileSystemId" --output text)
  
# wait for it to complete
status=$(aws fsx --region $REGION describe-file-systems --file-system-ids $filesystem_id --query "FileSystems[0].Lifecycle" --output text)
while [ status != "AVAILABLE" ]
do
  status=$(aws fsx --region $REGION describe-file-systems --file-system-ids $filesystem_id --query "FileSystems[0].Lifecycle" --output text)
  echo "$filesystem_id is $status..."
  sleep 2
done

# log filesystem dns name to a file
mkdir -p /opt/parallelcluster/$PROJECT_NAME
echo "filesystem_id=$(filesystem_id)" > /opt/parallelcluster/$PROJECT_NAME

2. submit.sbatch script

Next create a slurm submission script to mount and execute your job:

#!/bin/bash

PROJECT_NAME=$1
source /opt/parallelcluster/$PROJECT_NAME

# get filesystem information
filesystem_dns=$(aws fsx --region $REGION describe-file-systems --file-system-ids $filesystem_id --query "FileSystems[0].DNSName" --output text)
filesystem_mountname=$(aws fsx --region $REGION describe-file-systems --file-system-ids $filesystem_id --query "FileSystems[0].MountName" --output text)

# create mount dir
mkdir -p /fsx/$PROJECT_NAME

# mount filesystem
sudo mount -t lustre -o noatime,flock $filesystem_dns@tcp:/$filesystem_mountname /fsx/$PROJECT_NAME

3. delete-filesystem.sbatch script

cat > delete-filesystem.sbatch << EOF
#!/bin/bash
#SBATCH -n 1

PROJECT_NAME=$1
source /opt/parallelcluster/$PROJECT_NAME

# get region
AZ=$(curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone)
REGION=${AZ::-1}

# delete filesystem
aws fsx --region $REGION delete-file-system --file-system-ids $filesystem_id

# remove project config
rm /opt/parallelcluster/$PROJECT_NAME
EOF

Submit

PROJECT_NAME=test
$ sbatch create-filesystem.sbatch $PROJECT_NAME
Submitted job with id 1
$ sbatch -p od -d afterok:1 submit.sbatch $PROJECT_NAME
Submitted job with id 2
$ sbatch -p od -d after:2 delete-filesystem.sbatch $PROJECT_NAME
Submitted job with id 3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment