-
-
Save illusional/b70f870fa0e2f8e7a0ba0a9e71d568f5 to your computer and use it in GitHub Desktop.
# Configure Cromwell to submit jobs to Slurm, including support for Singularity | |
# | |
# Author: | |
# - Michael Franklin <michael.franklin@unimelb.edu.au> | |
# | |
# History: | |
# - 2020-07-23 - Initial + minor fixes to generalise the script | |
# - 2024-04-11 - Use bash instead of sh (based on @oneillkza suggestion) | |
# | |
# Quickstart: | |
# - Replace <location> with a location for a singularity cache. I'd recommend following this GitHub thread for some information about this: https://github.com/broadinstitute/cromwell/pull/5515 | |
# - [Optional] Add a queue after `String? queue` to be `String? queue = "yourqueue" | |
# | |
# About | |
# | |
# - Transform some job information and path to get a reasonable slurm job name including shard + cpu/mem (easier to track) | |
# - We submit a 'wrap' job (currently it's only implemented for submit-docker) to catch times where SLURM kills the job | |
# - The regular 'submit' just submits the variables as required | |
# - For "submit-docker", we use a cache location to pull images to. | |
# - 'duration' is in seconds, and can be passed from your WDL runtime (it's not currently a recognised K-V) | |
# - [OpenWDL #315](https://github.com/openwdl/wdl/pull/315) | |
# - Cromwell doesn't (/ didn't) support ToolTimeRequirement for CWL | |
akka: { | |
"actor.default-dispatcher.fork-join-executor": { | |
"parallelism-max": 3 | |
} | |
} | |
system: { | |
"job-shell": "/bin/bash" | |
} | |
backend: { | |
"default": "slurm-singularity", | |
"providers": { | |
"slurm-singularity": { | |
"actor-factory": "cromwell.backend.impl.sfs.config.ConfigBackendLifecycleActorFactory", | |
"config": { | |
"filesystems": { | |
"local": { | |
"localization": [ | |
"hard-link", | |
"cached-copy" | |
], | |
"enabled": true, | |
"caching": { | |
"duplication-strategy": [ | |
"hard-link", | |
"cached-copy", | |
"copy", | |
"soft-link" | |
], | |
"hashing-strategy": "fingerprint" | |
} | |
} | |
}, | |
"runtime-attributes": """ | |
Int duration = 86400 | |
Int? cpu = 1 | |
Int memory_mb = 3500 | |
String? docker | |
String? queue | |
String cacheLocation = "<location>" | |
""", | |
"submit": """ | |
jobname='${sub(sub(cwd, ".*call-", ""), "/", "-")}-cpu-${cpu}-mem-${memory_mb}' | |
sbatch \ | |
-J $jobname \ | |
-D ${cwd} \ | |
-o ${out} \ | |
-e ${err} \ | |
-t 0:${duration} \ | |
${"-p " + queue} \ | |
${"-n " + cpu} \ | |
--mem=${memory_mb} \ | |
--wrap "/usr/bin/env ${job_shell} ${script}" | |
""", | |
"submit-docker": """ | |
docker_subbed=$(sed -e 's/[^A-Za-z0-9._-]/_/g' <<< ${docker}) | |
image=${cacheLocation}/$docker_subbed.sif | |
lock_path=${cacheLocation}/$docker_subbed.lock | |
if [ ! -f "$image" ]; then | |
singularity pull $image docker://${docker} | |
fi | |
# Submit the script to SLURM | |
jobname=${sub(sub(cwd, ".*call-", ""), "/", "-")}-cpu-${cpu}-mem-${memory_mb} | |
JOBID=$(sbatch \ | |
--parsable \ | |
-J $jobname \ | |
--mem=${memory_mb} \ | |
--cpus-per-task ${select_first([cpu, 1])} \ | |
${"-p " + queue} \ | |
-D ${cwd} \ | |
-o ${cwd}/execution/stdout \ | |
-e ${cwd}/execution/stderr \ | |
-t '0:${duration}' \ | |
--wrap "singularity exec --bind ${cwd}:${docker_cwd} $image ${job_shell} ${docker_script}") \ | |
&& NTOKDEP=$(sbatch --parsable --kill-on-invalid-dep=yes --dependency=afternotokay:$JOBID --wrap '[ ! -f rc ] && (echo 1 >> ${cwd}/execution/rc) && (echo "A slurm error occurred" >> ${cwd}/execution/stderr)') \ | |
&& echo Submitted batch job $JOBID""", | |
"kill": "scancel ${job_id}", | |
"check-alive": "scontrol show job ${job_id}", | |
"job-id-regex": "Submitted batch job (\\d+).*" | |
} | |
} | |
} | |
} | |
call-caching: { | |
"enabled": true | |
} |
Same issue here -- using afternotok
seems to fix the problem (thanks for the heads up, @bpow!). I'm using slurm 22.05.0, and can't find any online documentation for slurm using afternotokay
. For others looking to use this, these were the errors I was getting before switching to afternotok
:
In the task stderr:
sbatch: error: Batch job submission failed: Job dependency problem
So I ran into an interesting problem with line 30, where I was running a workflow that ran the following (BASH-dependent) command in its scripts:
set -euo pipefail
But this is basically telling the system to use sh instead of bash:
system: {
"job-shell": "/bin/sh"
}
On many systems /bin/sh
is just a symlink to /bin/bash
, but not Ubuntu, which is what most containers are based on. So this workflow tried to run, within an Ubuntu-based container, and threw the error:
set: Illegal option -o pipefail
Which is to say, it's much safer to instead set:
system: {
"job-shell": "/bin/bash"
}
Thanks @oneillkza, good shout - I've changed this in the config above :)
I had to use
afternotok
instead ofafternotokay
-- could be a slurm version difference.