Sean Smith sean-smith

## aws-parallelcluster-dynamic-fsxl.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              1 star
            
          
                sean-smith
                / aws-parallelcluster-dynamic-fsxl.md
            
            
              Last active
              October 20, 2022 14:11
            
          
    Dynamic Filesystems with AWS ParallelCluster

You can dynamically create a filesystem per-job, this is useful for jobs that require a fast filesystem but don't want to pay to have the filesystem running 24/7. It's also useful to create a filesystem per-job.
In order to accomplish this without wasting time waiting for the filesystem to create (~15 mins), we've seperated this into three seperate jobs:

Create filesystem, only needs a single EC2 instance to run, can be run on head node. Takes 8-15 minutes.
Start job, this first mounts the filesystem before executing the job.
Delete filesystem


## slurm-mps-prolog.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              1 star
            
          
                sean-smith
                / slurm-mps-prolog.md
            
            
              Last active
              May 15, 2022 18:33
            
              
                Start CUDA MPS Server on each node
              
          
    👾 Slurm CUDA MPS Prolog

The following Slurm Prolog starts the CUDA MPS server on each compute node before the job is started.
cat << EOF > /opt/slurm/etc/prolog.sh
#!/bin/sh

# start mps
nvidia-cuda-mps-control -d

  
## emoji-wifi-name.md

      
              1 file
            
          
              0 forks
            
          
              2 comments
            
          
              2 stars
            
          
                sean-smith
                / emoji-wifi-name.md
            
            
              Last active
              October 17, 2023 03:06
            
          
    🚀 Wifi


So naturally the first thing I wanted to do when we got fiber internet was to rename the wifi network to something sexier than "CenturyLink0483". I decided on 🚀.
To do so I navigated to the router setup page at 192.168.0.1, cringing with all the 90's tech it employs.
Then I added 🚀 and tried to update.

  
## cost-explorer-tags.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                sean-smith
                / cost-explorer-tags.md
            
            
              Last active
              April 29, 2022 22:35
            
              
                Setup user and project level tags in AWS ParallelCluster
              
          
    AWS ParallelCluster Cost Explorer Tags

In a previous gist we discussed using cost explorer to see how much a cluster costs at the instance type and cluster level.
This gist describes how to get cost at the Project and User level.
Setup

1. Create a Policy


## failover-to-ondemand.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                sean-smith
                / failover-to-ondemand.md
            
            
              Created
              April 27, 2022 23:15
            
          
    Slurm Failover from Spot to On-Demand

In AWS ParallelCluster you can setup a cluster with two queues, one for Spot pricing and one for On-demand. When a job fails, due to a spot reclaimation, you can automatically requeue that job to OnDemand.
To set that up, first create a cluster with a Spot and OnDemand queue:
- Name: od
    ComputeResources:
      - Name: c6i-od-c6i32xlarge

  
## disable-hpc6a-cores.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                sean-smith
                / disable-hpc6a-cores.md
            
            
              Last active
              May 9, 2022 19:51
            
          
    How to disable hpc6a.48xlarge cores

Due to the EPYC architecture, it makes more sense to disable specific cores rather than let the scheduler choose which cores to run on. This is because each ZEN 3 core is attached to a compute complex that's made up of 4 cores, L2 and L3 cache, by disabling 1, 2 or 3 cores from the same compute complex, we increase the memory bandwidth of the remaining cores.

To do this, you can run the attached disable-cores.sh script on each instance:


## spot-starccm+-termination.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              1 star
            
          
                sean-smith
                / spot-starccm+-termination.md
            
            
              Last active
              June 17, 2022 18:37
            
              
                StarCCM+ Spot Instance Termination
              
          
    Save StarCCM+ State in AWS ParallelCluster

Spot termination gives a 2-minute warning before terminating the instance. This time period allows you to gracefully save data in order to resume later.
In the following I describe how this can be done with StarCCM+ in AWS ParallelCluster 3.X:
Setup


Create a post-install script spot.sh like so:


## pcluster-stop-itself.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                sean-smith
                / pcluster-stop-itself.md
            
            
              Created
              March 24, 2022 05:48
            
          
    AWS ParallelCluster Stop Head Node

Here's a simple crontab that stops the Head Node if it's been idle for 10 minutes.
*/10 * * * * if [ ! $(squeue | wc -l) -ge 2 ]; then aws ec2 stop-instances --instance-ids $(curl -s http://169.254.169.254/latest/meta-data/instance-id) --region $(curl -s http://169.254.169.254/latest/dynamic/instance-identity/document|grep region|awk -F\" '{print $4}'); fi
Since that line of code is awful and long let me break it down:

  
## aws-batch-fsxlustre.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              2 stars
            
          
                sean-smith
                / aws-batch-fsxlustre.md
            
            
              Last active
              July 7, 2022 04:56
            
          
    Mount FSx Lustre on AWS Batch

This guide describes how to mount FSx Lustre filesystem. I give an example cloudformation stack to create the AWS Batch resources.
I loosely follow this guide.
For the parameters, it's important that the Subnet, Security Group, FSx ID and Fsx Mount Name follow the guidelines below:


Parameter
Description


## sms.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                sean-smith
                / sms.md
            
            
              Created
              March 8, 2022 22:45
            
          
    Enable SMS MFA

To enable Multi-Factor Authentication (MFA) with Pcluster Manager there's two setup steps that need to be completed.

Setup an Origination number
Add a sandbox number

1. Setup an Origination Number


Navigate to Pinpoint Phone Numbers Console > Click Request Phone Number