Skip to content

Instantly share code, notes, and snippets.

Avatar

Sean Smith sean-smith

View GitHub Profile
@sean-smith
sean-smith / remove-bucket.sh
Last active February 14, 2023 20:01
BE CAREFUL. This removes buckets with the prefix you specify.
View remove-bucket.sh
#!/bin/bash
# Usage: bash remove-bucket.sh bucket1
for bucket in $(aws s3 ls | grep $1 | awk '{ print $3}'); do
echo "Deleting ${bucket}..."
aws s3 rm --recursive s3://${bucket};
aws s3 rb --force s3://${bucket};
done
@sean-smith
sean-smith / lstc-stuck-licenses.md
Created December 21, 2022 02:34
Remove stuck licenses from LSTC License Manager
View lstc-stuck-licenses.md

LSTC Remove Stuck Licenses

  1. Check status of licenses with lstc_qrun:
$ ./lstc_qrun
Defaulting to server 1 specified by LSTC_LICENSE_SERVER variable


                     Running Programs
@sean-smith
sean-smith / all-or-nothing.md
Last active June 21, 2022 23:54
Launch instances with AWS ParallelCluster All-or-nothing scaling
View all-or-nothing.md

Enable All-or-Nothing Scaling with AWS ParallelCluster

All or nothing scaling is useful when you need to run MPI jobs that can't start until all N instances have joined the cluster.

The way Slurm launches instances is in a best-effort fashion, i.e. if you request 10 instances but it can only get 9, it'll provision 9 then keep trying to get the last instance. This incurs cost for jobs that need all 10 instances before starting.

For example, if you submit a job like:

sbatch -N 10 ...
@sean-smith
sean-smith / intel-mpi.md
Created June 7, 2022 14:46
Install older intel mpi versions
View intel-mpi.md

Intel MPI Versions

IntelMPI 2018.2

Download and install

wget http://registrationcenter-download.intel.com/akdlm/irc_nas/tec/12748/l_mpi_2018.2.199.tgz
tar -xzf l_mpi_2018.2.199.tgz
cd l_mpi_2018.2.199/
sudo ./install.sh
@sean-smith
sean-smith / slurm-accounting-aws-parallelcluster.md
Last active May 25, 2022 02:13
Setup Slurm Accounting with AWS ParallelCluster
View slurm-accounting-aws-parallelcluster.md

Slurm Accounting with AWS ParallelCluster

In this tutorial we will work through setting up Slurm Accounting. This enables many features within slurm, including job resource tracking and providing a necessary building block to slurm federation.

Step 1 - Setup External Accounting Database

View aws-parallelcluster-dynamic-fsxl.md

Dynamic Filesystems with AWS ParallelCluster

You can dynamically create a filesystem per-job, this is useful for jobs that require a fast filesystem but don't want to pay to have the filesystem running 24/7. It's also useful to create a filesystem per-job.

In order to accomplish this without wasting time waiting for the filesystem to create (~15 mins), we've seperated this into three seperate jobs:

  1. Create filesystem, only needs a single EC2 instance to run, can be run on head node. Takes 8-15 minutes.
  2. Start job, this first mounts the filesystem before executing the job.
  3. Delete filesystem
@sean-smith
sean-smith / slurm-mps-prolog.md
Last active May 15, 2022 18:33
Start CUDA MPS Server on each node
View slurm-mps-prolog.md

👾 Slurm CUDA MPS Prolog

The following Slurm Prolog starts the CUDA MPS server on each compute node before the job is started.

cat << EOF > /opt/slurm/etc/prolog.sh
#!/bin/sh

# start mps
nvidia-cuda-mps-control -d
View emoji-wifi-name.md

🚀 Wifi

image

So naturally the first thing I wanted to do when we got fiber internet was to rename the wifi network to something sexier than "CenturyLink0483". I decided on 🚀.

To do so I navigated to the router setup page at 192.168.0.1, cringing with all the 90's tech it employs.

Then I added 🚀 and tried to update.

@sean-smith
sean-smith / cost-explorer-tags.md
Last active April 29, 2022 22:35
Setup user and project level tags in AWS ParallelCluster
View cost-explorer-tags.md

AWS ParallelCluster Cost Explorer Tags

In a previous gist we discussed using cost explorer to see how much a cluster costs at the instance type and cluster level.

This gist describes how to get cost at the Project and User level.

Setup

1. Create a Policy

View failover-to-ondemand.md

Slurm Failover from Spot to On-Demand

In AWS ParallelCluster you can setup a cluster with two queues, one for Spot pricing and one for On-demand. When a job fails, due to a spot reclaimation, you can automatically requeue that job to OnDemand.

To set that up, first create a cluster with a Spot and OnDemand queue:

- Name: od
    ComputeResources:
      - Name: c6i-od-c6i32xlarge