The following work with Son of Grid Engine (SGE) 8.1.9 as configured on the University of Sheffield's ShARC and Iceberg clusters.
You can use the -hold_jid <<job-name or job-name>>
option to make jobs run
only when other jobs have finished, rather than having jobs start and sit
waiting for other tasks to complete.
You can query multiple SGE log (accounting) files in series using:
for f in $SGE_ROOT/default/common/accounting $SGE_ROOT/default/common/accounting-archive/accounting-*; do
qacct -f $f -j '*'
done
qhost -j -h hal9000-node099
function sge_jobs_host_grp ()
{
if [[ $# -ne 1 || "$1" == '-h' ]]; then
echo "Show all jobs running on all host of a Grid Engine host group." 1>&2;
echo "usage: sge_jobs_host_grp name_of_sge_host_group" 1>&2;
return;
fi;
all_hosts="$(qconf -shgrp_tree $1 | grep $SGE_CLUSTER_NAME | xargs | tr ' ' ',')";
qhost -j -h $all_hosts
}
Here gpu
resources are general purpose GPUs (GPGPUs) and gfx
resources are for hardware-accellerated visualisation.
qhost -F gpu,gfx | grep -B1 'Host Resource'
qhost -l gpu=1 -u '*
qstat -j | grep 'not available'
qstat -f
qstat -u \* -q some.q -t -s r
Here:
-t
- extended information about the controlled sub-tasks of the displayed parallel jobs-s r
- job state is running
qstat -f -explain E -q somequeue.q
From an Administrative Host:
sudo -u root -i qmod -cq somequeue.q@hal9000-node099
e.g. because someone else is doing maintenance
qhost -q -h hal9000-node103
From an Administrative Host:
sudo -u root -i qmod -d somequeue.q@hal9000-node126
# Fix stuff :)
sudo -u root -i qmod -e somequeue.q@hal9000-node126
NB these operations support wildcards
From an Administrative Host:
sudo -u root -i qconf -mq somequeue.q
qstat -s r -g t
for aa in $(qconf -shgrpl) ; do
qconf -shgrp $aa | grep -q $1 && echo $aa ;
done
qconf -sobjl exechost complex_values '*gfx=*'
qconf -sobjl exechost complex_values '*gfx=2,*'
qconf -sobjl exechost load_values '*num_proc=20,*'
Similar to the above.
qselect -l gfx
qselect -l gfx=2
and possibly limit the output to queues that a user has access to:
qselect -l gfx=2 -u te1st
sudo -u root -i qconf -mattr exechost complex_values my_complex=some_value node666
qstat -u $USER -r
or
$ qstat -q gpu*.q -u '*' -r
From prolog scripts etc. Users need to:
qrsh -pty y bash -li
or
exec qrsh -pty y bash
(the second being effectively what qrshx does).
Node log on node nodeX
(includes messages from user-process-supervising shepherd processes):
/var/spool/sge/nodeX/messages
qmaster log:
$SGE_ROOT/default/spool/qmaster/messages
Location of hostfile on master node of parallel job:
/var/spool/sge/${HOSTNAME}/active_jobs/${JOB_ID}.1/pe_hostfile
munge -n | ssh node666 unmunge
A chain of rules for limiting resources (pre-defined or custom complexes) per resource consumer (e.g. user, host, queue, project, department, parallel env).
If a complex is defined so that FORCED is true then these complexes need to be explicitly requested by the user for them to be used.
Typically need to ensure resources reserved for larger (e.g. parallel) jobs to ensure that they're not forever waiting behind smaller (e.g. single-core) jobs. E.g.
$ qconf -ssconf
...
max_reservation 20
default_duration 8760:00:00
Need to then submit the larger jobs using -R y
.
Not SGE-specific but relevant and useful.
Works even in a qrsh session where JOB_ID
is not set.
cat /dev/cpuset/$(awk -F: '/cpuset/ { print $3 }' /proc/$$/cgroup)/cpuset.cpus
NB hyperthreads are enumerated in cgroups
Works even in a qrsh session where JOB_ID
is not set:
qstat -xml -j $(awk -F'/' '/sge/ {print $3}' /proc/$$/cpuset) | xmllint --xpath '/detailed_job_info/djob_info/element/JB_project[text()="gfx"]' - | grep -q gfx
SGE_LONG_JOB_NAMES=1 qstat -u someuser
Grid Engine creates a per-job unix group to help track resources associated with a job.
This group has an integer ID but not a name, as can be seen if you run groups
from a GE job:
$ groups
somegroup anothergroup groups: cannot find name for group ID 20016
20016 yetanothergroup
To learn the ID of the unix group for the current Grid Engine job:
awk -F= '/^add_grp_id/ { print $2 }' "${SGE_JOB_SPOOL_DIR}/config"
Found in my history after you'd been using my laptop
Show how many jobs each user is running right now