Skip to content

Instantly share code, notes, and snippets.

@wflynny
Last active July 29, 2020 02:00
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save wflynny/5b6922e375de4ace66b7ffd8a84ee19e to your computer and use it in GitHub Desktop.
Save wflynny/5b6922e375de4ace66b7ffd8a84ee19e to your computer and use it in GitHub Desktop.
Small utility to run top or nvidia-smi on a compute node from the login node
#!/usr/bin/env bash
TEMP=$(getopt -o hsg --long help,snapshot,gpu -n 'susuage' -- "$@")
if [ $? != 0 ] ; then echo "Terminating..." >&2 ; exit 1 ; fi
# Note the quotes around `$TEMP': they are essential!
eval set -- "$TEMP"
SNAPSHOT=false
GPU=false
while true; do
case "$1" in
-h | --help ) echo "susage [-h/--help] [-s/--snapshot] [-g/--gpu] JOBID"; exit 0 ;;
-s | --snapshot ) SNAPSHOT=true; shift ;;
-g | --gpu ) GPU=true; shift ;;
-- ) shift; break ;;
* ) break ;;
esac
done
JOBID=$1
if [[ -z $JOBID ]]; then
echo "Please supply SLURM jobid."
exit 2
fi
# Check that this job is not PENDING/HELD
[[ $(scontrol show job ${JOBID} | egrep -o "JobState=RUNNING") =~ RUNNING$ ]] || { echo "Job ${JOBID} is not currently running. Exiting."; exit 1; }
# Unfortunately nvidia-smi and top have different defaults,
# with top automatically polling and nvidia requiring an option to poll
top_options="-u ${USER}"
smi_options="-l 3"
if [[ "$SNAPSHOT" == true ]];
then
top_options="-n 1 -u ${USER}"
smi_options=""
fi
cmd="top ${top_options}"
srun_options="-p compute -q batch"
if [[ "$GPU" == true ]];
then
cmd="{ module load cuda10.0/toolkit; nvidia-smi ${smi_options} }"
srun_options="-q dev"
fi
nodelist="--$(scontrol show job ${JOBID} | egrep -o 'NodeList=([a-z0-9]+)' | tr [:upper:] [:lower:])"
srun ${nodelist} ${srun_options} --pty ${cmd}
@asafpr
Copy link

asafpr commented Jun 14, 2020

Should put parentheses around echo "Job ${JOBID} is not currently running. Exiting."; exit 1

@wflynny
Copy link
Author

wflynny commented Jun 14, 2020

Should put parentheses around echo "Job ${JOBID} is not currently running. Exiting."; exit 1

Changed the offending portion to look like { cmd1; cmd2; } so that the exit happens in the contexts of the current shell (rather than a subshell with (...)).

@wflynny
Copy link
Author

wflynny commented Jul 28, 2020

Note that the nvidia-smi functionality doesn't currently work. srun requires an executable, so the cmd should probably be a function, but can't get it to work 100% correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment