arteymix/slurm-primer.md

## slurm-primer.md

      
    Raw
  

              slurm-primer.md
            
          
    Slurm Primer!

I decided to write this primer to encourage the usage the lab members to use our
compute cluster.
We don’t really do distribute computing with tasks that run across nodes/processes
and inter-communicate. Thus, I’ll focus on the case where only one task is
asked (i.e. --tasks=1).
Slurm vocabulary


A job is a description of what to do, it is composed of one or more steps which
can execute serially or concurrently
A task is a unit of work performed by a job step
A node is a physical machine (i.e. bart).
TRES: trackable resources which is either CPUs, memory or GRES (generic resources).

Resource allocation

Request a number of CPUs:

--cpus-per-task 4 or --tres-per-task cpu:4
Request a certain amount of memory:

--mem 32G or --tres-per-task mem:32G
Request a certain amount of scratch space:

--gres:scratch:100G or --tres-per-taks gres/scratch:100G
Request a certain amout of time:

--time 1-12:20:15
Running jobs

srun

srun allows you to run commands via Slurm.
srun echo "Hello world!"
Interactive mode

Run your current shell with allocated resources.
srun --pty --mem 8G $SHELL
# a new shell will be created
$ echo "Hello world"
$ exit # or Ctrl-D
The --pty option will create a pseudo-terminal, making it possible to send
signal such as Ctrl-C and Ctrl-D.
Reference: https://slurm.schedmd.com/srun.html
Batch scripts (sbatch)

A batch script is a special kind of script that is interpreted by sbatch. It is
essentially a shell script with a special syntax for comments, allowing you to
conveniently allocate the resources you need.
In a file named script.sh:
#!/bin/bash
#SBATCH --cpus-per-task 4 --mem 32G
echo "Hello world!"
Which can then be launched with:
sbatch script.sh

A nicety of sbatch is that you can use srun within it to create job steps.
This allows you to create heterogenous jobs that can request different amout of
resources at different stage.
#!/bin/bash
srun --cpus-per-task 4 echo "Heavily parallelized Hello world!"
srun --cpus-per-task 1 echo "Regular Hello world!"
Nothing prevents you to run those steps in parallel, or interleave parallel and
serial steps.
#!/bin/bash
srun --cpus-per-task 4 echo "Heavily parallelized Hello world!" &
srun --cpus-per-task 1 echo "Regular Hello world!" &
wait # wait until both steps finish
Reference: https://slurm.schedmd.com/sbatch.html
Job array

Some workflow require the creation of hundreds if not thousands of similar jobs.
To not overwhelm Slurm -- and also your ability to track those jobs, you can
create a job array with the --array flag.
It accepts a range of integer values (i.e. 0-99 or multiple values separated
by commas (i.e. 1,2,3,5,6). Ranges are inclusive.
In the job, you have access to the $SLURM_ARRAY_TASK_ID variable.
#!/bin/bash
#!SBATCH --array 0-99
echo "Hello world from parallel universe #$SLURM_ARRAY_TASK_ID!"
Job arrays only work with sbatch.
Send job output and error to a file

Use --output and --error to redirect the output of a Slurm job to a file. You
can include the job ID %j.
#!/bin/bash
#SBATCH --output output-%j.log --error error-%j.log
echo "Hello world!" # will be written to output-%j.log
For job arrays, use '%A' for the first job ID and '%a' for the task ID.
#!/bin/bash
#SBATCH --array 0-99 --output output-%A-%a.log --error error-%A-%a.log
echo "Hello world!" # will be written to output-%A-%a.log
Email notification

To get notified by email when a job is completed of fails, add --mail-user
and --mail-type flags to your submission.
sbatch --mail-user=you@mail.ubc.ca --mail-type=END,FAIL script.sh
If you're using a job array, only one mail is sent for the whole job!
Tracking jobs

Once jobs are submitted to the cluster, you can monitor them with squeue.
squeue --me
The --me flag allows you to see your own jobs.
Reference: https://slurm.schedmd.com/squeue.html
To view the output of a job in real-time, use sattach:
sattach $jobid
Reference: https://slurm.schedmd.com/sattach.html
Cancelling jobs

Use scancel to cancel a job that is either pending or running.
scancel $jobid
Reference: https://slurm.schedmd.com/scancel.html