I decided to write this primer to encourage the usage the lab members to use our compute cluster.
We don’t really do distribute computing with tasks that run across nodes/processes
and inter-communicate. Thus, I’ll focus on the case where only one task is
asked (i.e. --tasks=1
).
- A job is a description of what to do, it is composed of one or more steps which can execute serially or concurrently
- A task is a unit of work performed by a job step
- A node is a physical machine (i.e. bart).
- TRES: trackable resources which is either CPUs, memory or GRES (generic resources).
--cpus-per-task 4
or --tres-per-task cpu:4
--mem 32G
or --tres-per-task mem:32G
--gres:scratch:100G
or --tres-per-taks gres/scratch:100G
--time 1-12:20:15
srun
allows you to run commands via Slurm.
srun echo "Hello world!"
Run your current shell with allocated resources.
srun --pty --mem 8G $SHELL
# a new shell will be created
$ echo "Hello world"
$ exit # or Ctrl-D
The --pty
option will create a pseudo-terminal, making it possible to send
signal such as Ctrl-C and Ctrl-D.
Reference: https://slurm.schedmd.com/srun.html
A batch script is a special kind of script that is interpreted by sbatch
. It is
essentially a shell script with a special syntax for comments, allowing you to
conveniently allocate the resources you need.
In a file named script.sh
:
#!/bin/bash
#SBATCH --cpus-per-task 4 --mem 32G
echo "Hello world!"
Which can then be launched with:
sbatch script.sh
A nicety of sbatch
is that you can use srun
within it to create job steps.
This allows you to create heterogenous jobs that can request different amout of
resources at different stage.
#!/bin/bash
srun --cpus-per-task 4 echo "Heavily parallelized Hello world!"
srun --cpus-per-task 1 echo "Regular Hello world!"
Nothing prevents you to run those steps in parallel, or interleave parallel and serial steps.
#!/bin/bash
srun --cpus-per-task 4 echo "Heavily parallelized Hello world!" &
srun --cpus-per-task 1 echo "Regular Hello world!" &
wait # wait until both steps finish
Reference: https://slurm.schedmd.com/sbatch.html
Some workflow require the creation of hundreds if not thousands of similar jobs.
To not overwhelm Slurm -- and also your ability to track those jobs, you can
create a job array with the --array
flag.
It accepts a range of integer values (i.e. 0-99
or multiple values separated
by commas (i.e. 1,2,3,5,6
). Ranges are inclusive.
In the job, you have access to the $SLURM_ARRAY_TASK_ID
variable.
#!/bin/bash
#!SBATCH --array 0-99
echo "Hello world from parallel universe #$SLURM_ARRAY_TASK_ID!"
Job arrays only work with sbatch
.
Use --output
and --error
to redirect the output of a Slurm job to a file. You
can include the job ID %j
.
#!/bin/bash
#SBATCH --output output-%j.log --error error-%j.log
echo "Hello world!" # will be written to output-%j.log
For job arrays, use '%A' for the first job ID and '%a' for the task ID.
#!/bin/bash
#SBATCH --array 0-99 --output output-%A-%a.log --error error-%A-%a.log
echo "Hello world!" # will be written to output-%A-%a.log
To get notified by email when a job is completed of fails, add --mail-user
and --mail-type
flags to your submission.
sbatch --mail-user=you@mail.ubc.ca --mail-type=END,FAIL script.sh
If you're using a job array, only one mail is sent for the whole job!
Once jobs are submitted to the cluster, you can monitor them with squeue
.
squeue --me
The --me
flag allows you to see your own jobs.
Reference: https://slurm.schedmd.com/squeue.html
To view the output of a job in real-time, use sattach
:
sattach $jobid
Reference: https://slurm.schedmd.com/sattach.html
Use scancel
to cancel a job that is either pending or running.
scancel $jobid
Reference: https://slurm.schedmd.com/scancel.html