All credit to @tpbrown
for this solution.
Usage:
ssh clustername
If your cluster tries 10 times to launch instances and fails, it'll automatically go into PROTECTED
mode. This disables instance provisioning until the compute fleet is restarted.
You'll see inact
as the status of the queue when the cluster is in PROTECTED
mode:
[ec2-user@ip-10-0-0-98 ~]$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
default* inact infinite 2 idle~ spot-dy-compute-[1-100]
#!/bin/bash | |
# Usage: bash remove-bucket.sh bucket1 | |
for bucket in $(aws s3 ls | grep $1 | awk '{ print $3}'); do | |
echo "Deleting ${bucket}..." | |
aws s3 rm --recursive s3://${bucket}; | |
aws s3 rb --force s3://${bucket}; | |
done |
All or nothing scaling is useful when you need to run MPI jobs that can't start until all N
instances have joined the cluster.
The way Slurm launches instances is in a best-effort fashion, i.e. if you request 10
instances but it can only get 9
, it'll provision 9
then keep trying to get the last instance. This incurs cost for jobs that need all 10 instances before starting.
For example, if you submit a job like:
sbatch -N 10 ...
You can dynamically create a filesystem per-job, this is useful for jobs that require a fast filesystem but don't want to pay to have the filesystem running 24/7. It's also useful to create a filesystem per-job.
In order to accomplish this without wasting time waiting for the filesystem to create (~15 mins), we've seperated this into three seperate jobs:
The following Slurm Prolog starts the CUDA MPS server on each compute node before the job is started.
cat << EOF > /opt/slurm/etc/prolog.sh
#!/bin/sh
# start mps
nvidia-cuda-mps-control -d