LSTC Remove Stuck Licenses
- Check status of licenses with
lstc_qrun
:
$ ./lstc_qrun
Defaulting to server 1 specified by LSTC_LICENSE_SERVER variable
Running Programs
#!/bin/bash | |
# Usage: bash remove-bucket.sh bucket1 | |
for bucket in $(aws s3 ls | grep $1 | awk '{ print $3}'); do | |
echo "Deleting ${bucket}..." | |
aws s3 rm --recursive s3://${bucket}; | |
aws s3 rb --force s3://${bucket}; | |
done |
lstc_qrun
:$ ./lstc_qrun
Defaulting to server 1 specified by LSTC_LICENSE_SERVER variable
Running Programs
All or nothing scaling is useful when you need to run MPI jobs that can't start until all N
instances have joined the cluster.
The way Slurm launches instances is in a best-effort fashion, i.e. if you request 10
instances but it can only get 9
, it'll provision 9
then keep trying to get the last instance. This incurs cost for jobs that need all 10 instances before starting.
For example, if you submit a job like:
sbatch -N 10 ...
Download and install
wget http://registrationcenter-download.intel.com/akdlm/irc_nas/tec/12748/l_mpi_2018.2.199.tgz
tar -xzf l_mpi_2018.2.199.tgz
cd l_mpi_2018.2.199/
sudo ./install.sh
You can dynamically create a filesystem per-job, this is useful for jobs that require a fast filesystem but don't want to pay to have the filesystem running 24/7. It's also useful to create a filesystem per-job.
In order to accomplish this without wasting time waiting for the filesystem to create (~15 mins), we've seperated this into three seperate jobs:
The following Slurm Prolog starts the CUDA MPS server on each compute node before the job is started.
cat << EOF > /opt/slurm/etc/prolog.sh
#!/bin/sh
# start mps
nvidia-cuda-mps-control -d
So naturally the first thing I wanted to do when we got fiber internet was to rename the wifi network to something sexier than "CenturyLink0483". I decided on
To do so I navigated to the router setup page at 192.168.0.1
, cringing with all the 90's tech it employs.
Then I added
In AWS ParallelCluster you can setup a cluster with two queues, one for Spot pricing and one for On-demand. When a job fails, due to a spot reclaimation, you can automatically requeue that job to OnDemand.
To set that up, first create a cluster with a Spot and OnDemand queue:
- Name: od
ComputeResources:
- Name: c6i-od-c6i32xlarge