andmax/Setup_SLURM

## Setup_SLURM
1- The problem is to make SLURM (https://slurm.schedmd.com/) work properly
2- SLURM is very intricate and difficult to set up (here is SLURM 20.02)
3- It may be important to have NVIDIA library path added, e.g.:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/nvidia-410
4- SLURM depends on MUNGE that can be installed using apt as:
sudo apt-get update
sudo apt-get install libmunge-dev libmunge2 munge
sudo apt-get clean
5- The same is not true for SLURM itself as its apt package is old
6- So it seems important to compile it from source code, wgetting aa:
wget https://download.schedmd.com/slurm/slurm-20.02.3.tar.bz2
tar jxvf slurm-20.02.3.tar.bz2
cd slurm-20.02.3
./configure --help &> ../config.help
sudo -E ./configure --with-hdf5=no --with-munge=/usr/lib/libmunge.so &> ../config.output
7- HDF5 may not be readily available for SLURM to compile against it
8- Configure defaults binaries to /usr/local/bin and configurations to /usr/local/etc
sudo -E make -j 39
sudo -E make install
9- After compiling and installing create proper folders and copy .conf files as
10- SLURM_CONF environment variable or -f argument to set where slurm.conf is
11- Do not work properly as they are not shared among all SLURM binaries:
sudo mkdir /var/spool/slurm /var/spool/slurm/d /var/spool/slurm/ctld
sudo mkdir /var/run/slurm /var/log/slurm
sudo cp slurm.conf /usr/local/etc/
sudo cp gres.conf /usr/local/etc/
12- Before starting SLURM itself, it is important to enable and start MUNGE:
sudo systemctl daemon-reload
sudo systemctl enable munge
sudo systemctl start munge
13- MUNGE will be used in slurm.conf as:
AuthType=auth/munge
CryptoType=crypto/munge
14- All outputs of the SLURM daemons can be redirected to /dev/null as
15- SLURM logs will be written in /var/log defined in slurm.conf as:
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm/slurmd.log
SlurmdSpoolDir=/var/spool/slurm/d
StateSaveLocation=/var/spool/slurm/ctld
16- SLURM DB daemon can be disregarded (MySQL can also be tricky to set up)
17- Without SLURM DB (MySQL) it is not possible to run sreport,
18- That may be important in listing available GPUs and their usage via:
sreport -tminper cluster utilization --tres="gres/gpu"
19- SLURM accounting storage and job completion can be written in text files
20- Also in /var/log defined in slurm.conf as:
AccountingStorageType=accounting_storage/filetxt
AccountingStorageLoc=/var/log/slurm/accounting.txt
AccountingStoreJobComment=YES
JobCompType=jobcomp/filetxt
JobCompLoc=/var/log/slurm/job_completion.txt
21- SLURM user is tricky to use and configure so it is better to
22- Use root instead defined in slurm.conf as:
SlurmUser=root
SlurmdUser=root
23- SLURM run PIDs and PORTs
SlurmctldPidFile=/var/run/slurm/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm/slurmd.pid
SlurmdPort=6818
ProctrackType=proctrack/linuxproc
24- SLURM can track GPUs as resources by defining in slurm.conf as:
SelectType=select/cons_tres
SelectTypeParameters=CR_CPU_Memory
GresTypes=gpu,mps,gpu_mem
25- The end of SLURM configuration is a list of nodes and partitions
26- A partition is simply a set of nodes
27- SLURM confuses the concept of threads and processes also cores and CPUs
28- So if setting ThreadsPerCore, CoresPerSocket, Socket, etc. via lscpu as:
cat nodes.txt | while read node; do echo -e "NodeName=$node RealMemory=$(expr $(grep MemTotal /proc/meminfo | awk '{print $2}') / 1024) Sockets=$(lscpu | grep Socket\(s\) | awk '{print $2}') CoresPerSocket=$(lscpu | grep Core\(s\) | awk '{print $4}') ThreadsPerCore=$(lscpu | grep Thread\(s\) | awk '{print $4}') Gres=gpu:tesla_k80:no_consume:1,gpu_mem:11441 State=UNKNOWN\n"; done >> slurm.conf
29- It may end up with the right number of CPUs but allocation will be messed up
30- One job consuming 1 CPU or Core will end up taking N CPUs, where N=ThreadsPerCore
31- To avoid these SLURM confusions, it may be better to just set Procs and avoid lscpu as:
cat nodes.txt | while read node; do echo -e "NodeName=$node RealMemory=$(expr $(grep MemTotal /proc/meminfo | awk '{print $2}') / 1024) Procs=$(nproc) Gres=gpu:tesla_k80:no_consume:1,gpu_mem:11441 State=UNKNOWN\n"; done >> slurm.conf
32- Also managing GPU resources in SLURM is a bit tricky,
33- Multiple jobs can use the same GPU by using "no_consume" option in Gres within NodeName
34- Together with gpu_mem as total of the GPU memory in MB in the same option
35- Also GresTypes (above) must specify mps (for multiple processes) and gpu_mem to control GPU memory
36- SLURM gres.conf needs also to include gpu_mem and its count as total GPU memory (below)
37- After defining each node in the NodeName lines as above,
38- Simply define 1 partition with all nodes via:
echo "PartitionName=all Nodes=$(cat nodes.txt | tr '\n' ',' | sed s/.$// -) Default=YES MaxTime=INFINITE State=UP" >> slurm.conf
39- SLURM gres.conf may be as simple as:
AutoDetect=nvml
Name=gpu_mem Count=11441
40- After the above set up, SLURM two daemons can be started:
nohup sudo slurmctld -D -vvvvvv &> /dev/null &
nohup sudo slurmd -D -vvvvvv &> /dev/null &
41- Check the /var/log files to make sure SLURM is working
42- Then to run SLURM jobs do:
srun --gres=<gpu_to_use> --mem-per-gpu=<gpu_mem> --output=<output_file> <exec_and_params>
43- After running a SLURM job it can be checked using its job id via:
scontrol show job <job_id>
44- Stopping SLURM daemons is not enough to finish all the jobs it has started,
45- So the job can be cancelled with SLURM running via:
scancel <job_id>
46- To se the queue of jobs do:
squeue
squeue -o %b
squeue -h -t R -O Gres
squeue -o  "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %23R %8C %23b"
47- To check information of all nodes do:
sinfo -o "%23N %10c %10m %20C %23G %10A"
scontrol -o show nodes
48- To check information of all jobs sudo must be used because of /var/log permission:
sudo sacct -X
sudo sacct -a -X --format=JobID,AllocCPUS,Reqgres
49- The following links may be useful in addition to SLURM manual:
https://ulhpc-tutorials.readthedocs.io/en/latest/scheduling/advanced/
https://slurm.schedmd.com/SLUG19/GPU_Scheduling_and_Cons_Tres.pdf
http://www.hpcadvisorycouncil.com/events/2014/swiss-workshop/presos/Day_1/3_GPUs_SLURM.pdf
	1- The problem is to make SLURM (https://slurm.schedmd.com/) work properly
	2- SLURM is very intricate and difficult to set up (here is SLURM 20.02)
	3- It may be important to have NVIDIA library path added, e.g.:
	export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/nvidia-410
	4- SLURM depends on MUNGE that can be installed using apt as:
	sudo apt-get update
	sudo apt-get install libmunge-dev libmunge2 munge
	sudo apt-get clean
	5- The same is not true for SLURM itself as its apt package is old
	6- So it seems important to compile it from source code, wgetting aa:
	wget https://download.schedmd.com/slurm/slurm-20.02.3.tar.bz2
	tar jxvf slurm-20.02.3.tar.bz2
	cd slurm-20.02.3
	./configure --help &> ../config.help
	sudo -E ./configure --with-hdf5=no --with-munge=/usr/lib/libmunge.so &> ../config.output
	7- HDF5 may not be readily available for SLURM to compile against it
	8- Configure defaults binaries to /usr/local/bin and configurations to /usr/local/etc
	sudo -E make -j 39
	sudo -E make install
	9- After compiling and installing create proper folders and copy .conf files as
	10- SLURM_CONF environment variable or -f argument to set where slurm.conf is
	11- Do not work properly as they are not shared among all SLURM binaries:
	sudo mkdir /var/spool/slurm /var/spool/slurm/d /var/spool/slurm/ctld
	sudo mkdir /var/run/slurm /var/log/slurm
	sudo cp slurm.conf /usr/local/etc/
	sudo cp gres.conf /usr/local/etc/
	12- Before starting SLURM itself, it is important to enable and start MUNGE:
	sudo systemctl daemon-reload
	sudo systemctl enable munge
	sudo systemctl start munge
	13- MUNGE will be used in slurm.conf as:
	AuthType=auth/munge
	CryptoType=crypto/munge
	14- All outputs of the SLURM daemons can be redirected to /dev/null as
	15- SLURM logs will be written in /var/log defined in slurm.conf as:
	SlurmctldDebug=3
	SlurmctldLogFile=/var/log/slurm/slurmctld.log
	SlurmdDebug=3
	SlurmdLogFile=/var/log/slurm/slurmd.log
	SlurmdSpoolDir=/var/spool/slurm/d
	StateSaveLocation=/var/spool/slurm/ctld
	16- SLURM DB daemon can be disregarded (MySQL can also be tricky to set up)
	17- Without SLURM DB (MySQL) it is not possible to run sreport,
	18- That may be important in listing available GPUs and their usage via:
	sreport -tminper cluster utilization --tres="gres/gpu"
	19- SLURM accounting storage and job completion can be written in text files
	20- Also in /var/log defined in slurm.conf as:
	AccountingStorageType=accounting_storage/filetxt
	AccountingStorageLoc=/var/log/slurm/accounting.txt
	AccountingStoreJobComment=YES
	JobCompType=jobcomp/filetxt
	JobCompLoc=/var/log/slurm/job_completion.txt
	21- SLURM user is tricky to use and configure so it is better to
	22- Use root instead defined in slurm.conf as:
	SlurmUser=root
	SlurmdUser=root
	23- SLURM run PIDs and PORTs
	SlurmctldPidFile=/var/run/slurm/slurmctld.pid
	SlurmctldPort=6817
	SlurmdPidFile=/var/run/slurm/slurmd.pid
	SlurmdPort=6818
	ProctrackType=proctrack/linuxproc
	24- SLURM can track GPUs as resources by defining in slurm.conf as:
	SelectType=select/cons_tres
	SelectTypeParameters=CR_CPU_Memory
	GresTypes=gpu,mps,gpu_mem
	25- The end of SLURM configuration is a list of nodes and partitions
	26- A partition is simply a set of nodes
	27- SLURM confuses the concept of threads and processes also cores and CPUs
	28- So if setting ThreadsPerCore, CoresPerSocket, Socket, etc. via lscpu as:
	cat nodes.txt \| while read node; do echo -e "NodeName=$node RealMemory=$(expr $(grep MemTotal /proc/meminfo \| awk '{print $2}') / 1024) Sockets=$(lscpu \| grep Socket\(s\) \| awk '{print $2}') CoresPerSocket=$(lscpu \| grep Core\(s\) \| awk '{print $4}') ThreadsPerCore=$(lscpu \| grep Thread\(s\) \| awk '{print $4}') Gres=gpu:tesla_k80:no_consume:1,gpu_mem:11441 State=UNKNOWN\n"; done >> slurm.conf
	29- It may end up with the right number of CPUs but allocation will be messed up
	30- One job consuming 1 CPU or Core will end up taking N CPUs, where N=ThreadsPerCore
	31- To avoid these SLURM confusions, it may be better to just set Procs and avoid lscpu as:
	cat nodes.txt \| while read node; do echo -e "NodeName=$node RealMemory=$(expr $(grep MemTotal /proc/meminfo \| awk '{print $2}') / 1024) Procs=$(nproc) Gres=gpu:tesla_k80:no_consume:1,gpu_mem:11441 State=UNKNOWN\n"; done >> slurm.conf
	32- Also managing GPU resources in SLURM is a bit tricky,
	33- Multiple jobs can use the same GPU by using "no_consume" option in Gres within NodeName
	34- Together with gpu_mem as total of the GPU memory in MB in the same option
	35- Also GresTypes (above) must specify mps (for multiple processes) and gpu_mem to control GPU memory
	36- SLURM gres.conf needs also to include gpu_mem and its count as total GPU memory (below)
	37- After defining each node in the NodeName lines as above,
	38- Simply define 1 partition with all nodes via:
	echo "PartitionName=all Nodes=$(cat nodes.txt \| tr '\n' ',' \| sed s/.$// -) Default=YES MaxTime=INFINITE State=UP" >> slurm.conf
	39- SLURM gres.conf may be as simple as:
	AutoDetect=nvml
	Name=gpu_mem Count=11441
	40- After the above set up, SLURM two daemons can be started:
	nohup sudo slurmctld -D -vvvvvv &> /dev/null &
	nohup sudo slurmd -D -vvvvvv &> /dev/null &
	41- Check the /var/log files to make sure SLURM is working
	42- Then to run SLURM jobs do:
	srun --gres=<gpu_to_use> --mem-per-gpu=<gpu_mem> --output=<output_file> <exec_and_params>
	43- After running a SLURM job it can be checked using its job id via:
	scontrol show job <job_id>
	44- Stopping SLURM daemons is not enough to finish all the jobs it has started,
	45- So the job can be cancelled with SLURM running via:
	scancel <job_id>
	46- To se the queue of jobs do:
	squeue
	squeue -o %b
	squeue -h -t R -O Gres
	squeue -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %23R %8C %23b"
	47- To check information of all nodes do:
	sinfo -o "%23N %10c %10m %20C %23G %10A"
	scontrol -o show nodes
	48- To check information of all jobs sudo must be used because of /var/log permission:
	sudo sacct -X
	sudo sacct -a -X --format=JobID,AllocCPUS,Reqgres
	49- The following links may be useful in addition to SLURM manual:
	https://ulhpc-tutorials.readthedocs.io/en/latest/scheduling/advanced/
	https://slurm.schedmd.com/SLUG19/GPU_Scheduling_and_Cons_Tres.pdf
	http://www.hpcadvisorycouncil.com/events/2014/swiss-workshop/presos/Day_1/3_GPUs_SLURM.pdf