Use apt to install the necessary packages:
sudo apt install -y slurm-wlm slurm-wlm-doc
Load file:///usr/share/doc/slurm-wlm/html/configurator.html in a browser (or file://wsl%24/Ubuntu/usr/share/doc/slurm-wlm/html/configurator.html on WSL2), and:
- Set your machine's hostname in
SlurmctldHost
andNodeName
. - Set
CPUs
as appropriate, and optionallySockets
,CoresPerSocket
, andThreadsPerCore
. Use commandlscpu
to find what you have. - Set
RealMemory
to the number of megabytes you want to allocate to Slurm jobs, - Set
StateSaveLocation
to/var/spool/slurm-llnl
. - Set
ProctrackType
tolinuxproc
because processes are less likely to escape Slurm control on a single machine config. - Make sure
SelectType
is set toCons_res
, and setSelectTypeParameters
toCR_Core_Memory
. - Set
JobAcctGatherType
toLinux
to gather resource use per job, and setAccountingStorageType
toFileTxt
.
Hit Submit
, and save the resulting text into /etc/slurm-llnl/slurm.conf
i.e. the configuration file referred to in /lib/systemd/system/slurmctld.service
and /lib/systemd/system/slurmd.service
.
Load /etc/slurm-llnl/slurm.conf
in a text editor, uncomment DefMemPerCPU
, and set it to 8192
or whatever number of megabytes you want each job to request if not explicitly requested using --mem
during job submission. Read the docs and edit other defaults as you see fit.
Create /var/spool/slurm-llnl
and /var/log/slurm_jobacct.log
, then set ownership appropriately:
sudo mkdir -p /var/spool/slurm-llnl
sudo touch /var/log/slurm_jobacct.log
sudo chown slurm:slurm /var/spool/slurm-llnl /var/log/slurm_jobacct.log
Install mailutils
so that Slurm won't complain about /bin/mail
missing:
sudo apt install -y mailutils
Make sure munge is installed and running, and a munge.key
was created with user-only read-only permissions, owned by munge:munge
:
sudo service munge start
sudo ls -l /etc/munge/munge.key
Start services slurmctld
and slurmd
:
sudo service slurmd start
sudo service slurmctld start
Hi and thank you for sharing! I have reached somehow similar issue. I change the hostname and also ControlMachine to the same exact name and now I am running to the following issue (see below) - I appreciate any inputs on how to fix this one.
slurmd.service - Slurm node daemon
Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
Active: failed (Result: timeout) since Thu 2021-09-30 12:02:02 EDT; 13s ago
Process: 26000 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
Sep 30 12:01:11 quanzeng-PowerEdge-T420 slurmd[26002]: error: Unable to register: Unable to contact slurm controller (connect failure)
Sep 30 12:01:21 quanzeng-PowerEdge-T420 slurmd[26002]: error: Unable to register: Unable to contact slurm controller (connect failure)
Sep 30 12:01:31 quanzeng-PowerEdge-T420 slurmd[26002]: error: Unable to register: Unable to contact slurm controller (connect failure)
Sep 30 12:01:41 quanzeng-PowerEdge-T420 slurmd[26002]: error: Unable to register: Unable to contact slurm controller (connect failure)
Sep 30 12:01:51 quanzeng-PowerEdge-T420 slurmd[26002]: error: Unable to register: Unable to contact slurm controller (connect failure)
Sep 30 12:01:52 quanzeng-PowerEdge-T420 systemd[1]: slurmd.service: Start operation timed out. Terminating.
Sep 30 12:02:01 quanzeng-PowerEdge-T420 slurmd[26002]: error: Unable to register: Unable to contact slurm controller (connect failure)
Sep 30 12:02:02 quanzeng-PowerEdge-T420 systemd[1]: Failed to start Slurm node daemon.
Sep 30 12:02:02 quanzeng-PowerEdge-T420 systemd[1]: slurmd.service: Unit entered failed state.
Sep 30 12:02:02 quanzeng-PowerEdge-T420 systemd[1]: slurmd.service: Failed with result 'timeout'.