Sean Smith sean-smith

## pcluster-ssh.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              1 star
            
          
                sean-smith
                / pcluster-ssh.md
            
            
              Last active
              November 8, 2023 08:02
            
              
                Easily SSH into your cluster
              
          
    ParallelCluster Easy SSH

All credit to @tpbrown for this solution.
Usage:

ssh clustername


## protected-mode.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                sean-smith
                / protected-mode.md
            
            
              Created
              October 17, 2023 17:50
            
          
    Disable Protected Mode in AWS ParallelCluster

If your cluster tries 10 times to launch instances and fails, it'll automatically go into PROTECTED mode. This disables instance provisioning until the compute fleet is restarted.
You'll see inact as the status of the queue when the cluster is in PROTECTED mode:
[ec2-user@ip-10-0-0-98 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
default* inact infinite 2 idle~ spot-dy-compute-[1-100]


## cost-monitoring.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                sean-smith
                / cost-monitoring.md
            
            
              Created
              April 12, 2023 19:09
            
              
                Activate cost allocation tags and check on the status of them programatically 
              
          
    Activate AWS Tags in Cost Explorer

CLI

aws ce update-cost-allocation-tags-status —cost-allocation-tags-status="TagKey='parallelcluster:cluster-name',Status='activate'"
aws ce list-cost-allocation-tags --tag-keys='parallelcluster:cluster-name'
{
    "CostAllocationTags": [
        {

  
## remove-bucket.sh
#!/bin/bash

# Usage: bash remove-bucket.sh bucket1

for bucket in $(aws s3 ls | grep $1 | awk '{ print $3}'); do
        echo "Deleting ${bucket}..."
        aws s3 rm --recursive s3://${bucket};
        aws s3 rb --force s3://${bucket};
done

## lstc-stuck-licenses.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                sean-smith
                / lstc-stuck-licenses.md
            
            
              Created
              December 21, 2022 02:34
            
              
                Remove stuck licenses from LSTC License Manager
              
          
    LSTC Remove Stuck Licenses


Check status of licenses with lstc_qrun:

$ ./lstc_qrun
Defaulting to server 1 specified by LSTC_LICENSE_SERVER variable


                     Running Programs

  
## all-or-nothing.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                sean-smith
                / all-or-nothing.md
            
            
              Last active
              June 21, 2022 23:54
            
              
                Launch instances with AWS ParallelCluster All-or-nothing scaling
              
          
    Enable All-or-Nothing Scaling with AWS ParallelCluster

All or nothing scaling is useful when you need to run MPI jobs that can't start until all N instances have joined the cluster.
The way Slurm launches instances is in a best-effort fashion, i.e. if you request 10 instances but it can only get 9, it'll provision 9 then keep trying to get the last instance. This incurs cost for jobs that need all 10 instances before starting.
For example, if you submit a job like:
sbatch -N 10 ...

  
## intel-mpi.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              1 star
            
          
                sean-smith
                / intel-mpi.md
            
            
              Created
              June 7, 2022 14:46
            
              
                Install older intel mpi versions
              
          
    Intel MPI Versions

IntelMPI 2018.2

Download and install
wget http://registrationcenter-download.intel.com/akdlm/irc_nas/tec/12748/l_mpi_2018.2.199.tgz
tar -xzf l_mpi_2018.2.199.tgz
cd l_mpi_2018.2.199/
sudo ./install.sh

  
## slurm-accounting-aws-parallelcluster.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                sean-smith
                / slurm-accounting-aws-parallelcluster.md
            
            
              Last active
              May 25, 2022 02:13
            
              
                Setup Slurm Accounting with AWS ParallelCluster
              
          
    Slurm Accounting with AWS ParallelCluster


In this tutorial we will work through setting up Slurm Accounting. This enables many features within slurm, including job resource tracking and providing a necessary building block to slurm federation.
Step 1 - Setup External Accounting Database


## aws-parallelcluster-dynamic-fsxl.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              1 star
            
          
                sean-smith
                / aws-parallelcluster-dynamic-fsxl.md
            
            
              Last active
              October 20, 2022 14:11
            
          
    Dynamic Filesystems with AWS ParallelCluster

You can dynamically create a filesystem per-job, this is useful for jobs that require a fast filesystem but don't want to pay to have the filesystem running 24/7. It's also useful to create a filesystem per-job.
In order to accomplish this without wasting time waiting for the filesystem to create (~15 mins), we've seperated this into three seperate jobs:

Create filesystem, only needs a single EC2 instance to run, can be run on head node. Takes 8-15 minutes.
Start job, this first mounts the filesystem before executing the job.
Delete filesystem


## slurm-mps-prolog.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              1 star
            
          
                sean-smith
                / slurm-mps-prolog.md
            
            
              Last active
              May 15, 2022 18:33
            
              
                Start CUDA MPS Server on each node
              
          
    👾 Slurm CUDA MPS Prolog

The following Slurm Prolog starts the CUDA MPS server on each compute node before the job is started.
cat << EOF > /opt/slurm/etc/prolog.sh
#!/bin/sh

# start mps
nvidia-cuda-mps-control -d