Skip to content

Instantly share code, notes, and snippets.

@hungyiwu
Created June 23, 2020 13:59
Show Gist options
  • Save hungyiwu/444d1e8baeccfa58b133b141f5c1a7d6 to your computer and use it in GitHub Desktop.
Save hungyiwu/444d1e8baeccfa58b133b141f5c1a7d6 to your computer and use it in GitHub Desktop.
Handy one-line command to get SLURM job array task IDs timed out for re-run
#!/bin/bash
# Use case:
# You ran a job array in SLURM by something like `sbatch --array=1-300 run.sh`
# ...checked the error logs by `tail -n 1 *.err` and saw many got timed-out
# ...would like to re-run tasks with higher time quota but first need to know their task IDs
# ...instead of jotting down the IDs by scanning the terminal with your eyes,
# or writting yet-another Python script to parse the error logs,
# you can use this one-line command
sacct -j [JOB-ID] -s to --brief\
| grep TIMEOUT\
| cut -d ' ' -f 1\
| cut -d '_' -f 2\
| paste -sd ','
# Explanation:
# sacct: `-j [JOB-ID]` filters by job ID, `-s to` filters by job state (`to` for TimeOut)
# `--brief` gives cleaner output
# grep: not sure why but `sacct -s to` also gives lines with state `CANCELLED`, so add another filter here
# at this point the output will look like
# ```
# 11166431_247 TIMEOUT 0:0
# 11166431_249 TIMEOUT 0:0
# ```
# cut: `-d ' '` splits each line by delimiter of space and `-f 1` keeps only the first field
# this gives
# ```
# 11166431_247
# 11166431_249
# ```
# `-d '_' -f 2` splits each line by an underscore and keeps only the second field
# this gives
# ```
# 247
# 249
# ```
# paste: `-s` combines all lines into one line, and `-d ','` inserts a comma as the delimiter
# this gives
# ```
# 247,249
# ```
# Now you can change the time limit in the original job script (ex. `run.sh`) and copy-paste that task IDs for re-run
# something like this `sbatch --array=247,249 run.sh`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment