Skip to content

Instantly share code, notes, and snippets.

@ckandoth
Last active May 19, 2024 07:49
Show Gist options
  • Save ckandoth/2acef6310041244a690e4c08d2610423 to your computer and use it in GitHub Desktop.
Save ckandoth/2acef6310041244a690e4c08d2610423 to your computer and use it in GitHub Desktop.
Install Slurm 19.05 on a standalone machine running Ubuntu 20.04

Use apt to install the necessary packages:

sudo apt install -y slurm-wlm slurm-wlm-doc

Load file:///usr/share/doc/slurm-wlm/html/configurator.html in a browser (or file://wsl%24/Ubuntu/usr/share/doc/slurm-wlm/html/configurator.html on WSL2), and:

  1. Set your machine's hostname in SlurmctldHost and NodeName.
  2. Set CPUs as appropriate, and optionally Sockets, CoresPerSocket, and ThreadsPerCore. Use command lscpu to find what you have.
  3. Set RealMemory to the number of megabytes you want to allocate to Slurm jobs,
  4. Set StateSaveLocation to /var/spool/slurm-llnl.
  5. Set ProctrackType to linuxproc because processes are less likely to escape Slurm control on a single machine config.
  6. Make sure SelectType is set to Cons_res, and set SelectTypeParameters to CR_Core_Memory.
  7. Set JobAcctGatherType to Linux to gather resource use per job, and set AccountingStorageType to FileTxt.

Hit Submit, and save the resulting text into /etc/slurm-llnl/slurm.conf i.e. the configuration file referred to in /lib/systemd/system/slurmctld.service and /lib/systemd/system/slurmd.service.

Load /etc/slurm-llnl/slurm.conf in a text editor, uncomment DefMemPerCPU, and set it to 8192 or whatever number of megabytes you want each job to request if not explicitly requested using --mem during job submission. Read the docs and edit other defaults as you see fit.

Create /var/spool/slurm-llnl and /var/log/slurm_jobacct.log, then set ownership appropriately:

sudo mkdir -p /var/spool/slurm-llnl
sudo touch /var/log/slurm_jobacct.log
sudo chown slurm:slurm /var/spool/slurm-llnl /var/log/slurm_jobacct.log

Install mailutils so that Slurm won't complain about /bin/mail missing:

sudo apt install -y mailutils

Make sure munge is installed and running, and a munge.key was created with user-only read-only permissions, owned by munge:munge:

sudo service munge start
sudo ls -l /etc/munge/munge.key

Start services slurmctld and slurmd:

sudo service slurmd start
sudo service slurmctld start
@gangadharsingh056
Copy link

Getting this error when "apt install munge" getting error "Errors were encountered while processing:
postfix
E: Sub-process /usr/bin/dpkg returned an error code (1)"
and checking slurmd.service
root@:/etc/slurm-llnl# sudo apt install munge
Reading package lists... Done
Building dependency tree
Reading state information... Done
munge is already the newest version (0.5.13-2build1).
munge set to manually installed.
0 upgraded, 0 newly installed, 0 to remove and 52 not upgraded.
1 not fully installed or removed.
After this operation, 0 B of additional disk space will be used.
Do you want to continue? [Y/n] Y
Setting up postfix (3.4.13-0ubuntu1.2) ...

Postfix (main.cf) configuration was not changed. If you need to make changes,
edit /etc/postfix/main.cf (and others) as needed. To view Postfix
configuration values, see postconf(1).

After modifying main.cf, be sure to run 'systemctl reload postfix'.

Running newaliases
newaliases: warning: valid_hostname: misplaced hyphen: gpunode1-wlp0s20f3.--
newaliases: fatal: file /etc/postfix/main.cf: parameter myhostname: bad parameter value: gpunode1-wlp0s20f3.--
dpkg: error processing package postfix (--configure):
installed postfix package post-installation script subprocess returned error exit status 75
Processing triggers for libc-bin (2.31-0ubuntu9.2) ...
Errors were encountered while processing:
postfix
E: Sub-process /usr/bin/dpkg returned an error code (1)

root@gpunode1:/etc/slurm-llnl# systemctl status slurmd.service
● slurmd.service - Slurm node daemon
Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
Active: active (running) since Fri 2021-10-01 11:12:34 IST; 3min 49s ago
Docs: man:slurmd(8)
Main PID: 26727 (slurmd)
Tasks: 2
Memory: 2.9M
CGroup: /system.slice/slurmd.service
└─26727 /usr/sbin/slurmd

Oct 01 11:16:09 gpunode1 slurmd-gpunode1[26727]: error: Unable to register: Resource temporarily unavailable
Oct 01 11:16:10 gpunode1 slurmd-gpunode1[26727]: error: Unable to resolve "linuxK": Host name lookup failure
Oct 01 11:16:10 gpunode1 slurmd-gpunode1[26727]: error: Unable to establish control machine address
Oct 01 11:16:10 gpunode1 slurmd-gpunode1[26727]: error: Unable to register: Resource temporarily unavailable
Oct 01 11:16:12 gpunode1 slurmd-gpunode1[26727]: error: Unable to resolve "linuxK": Host name lookup failure
Oct 01 11:16:12 gpunode1 slurmd-gpunode1[26727]: error: Unable to establish control machine address
Oct 01 11:16:12 gpunode1 slurmd-gpunode1[26727]: error: Unable to register: Resource temporarily unavailable
Oct 01 11:16:13 gpunode1 slurmd-gpunode1[26727]: error: Unable to resolve "linuxK": Host name lookup failure
Oct 01 11:16:13 gpunode1 slurmd-gpunode1[26727]: error: Unable to establish control machine address
Oct 01 11:16:13 gpunode1 slurmd-gpunode1[26727]: error: Unable to register: Resource temporarily unavailable

Any suggestion how to resolve it?

@SmallPackage
Copy link

SmallPackage commented Oct 14, 2021

Hi and thank you for sharing! I have reached somehow similar issue. I change the hostname and also ControlMachine to the same exact name and now I am running to the following issue (see below) - I appreciate any inputs on how to fix this one. slurmd.service - Slurm node daemon Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled) Active: failed (Result: timeout) since Thu 2021-09-30 12:02:02 EDT; 13s ago Process: 26000 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)

Sep 30 12:01:11 quanzeng-PowerEdge-T420 slurmd[26002]: error: Unable to register: Unable to contact slurm controller (connect failure) Sep 30 12:01:21 quanzeng-PowerEdge-T420 slurmd[26002]: error: Unable to register: Unable to contact slurm controller (connect failure) Sep 30 12:01:31 quanzeng-PowerEdge-T420 slurmd[26002]: error: Unable to register: Unable to contact slurm controller (connect failure) Sep 30 12:01:41 quanzeng-PowerEdge-T420 slurmd[26002]: error: Unable to register: Unable to contact slurm controller (connect failure) Sep 30 12:01:51 quanzeng-PowerEdge-T420 slurmd[26002]: error: Unable to register: Unable to contact slurm controller (connect failure) Sep 30 12:01:52 quanzeng-PowerEdge-T420 systemd[1]: slurmd.service: Start operation timed out. Terminating. Sep 30 12:02:01 quanzeng-PowerEdge-T420 slurmd[26002]: error: Unable to register: Unable to contact slurm controller (connect failure) Sep 30 12:02:02 quanzeng-PowerEdge-T420 systemd[1]: Failed to start Slurm node daemon. Sep 30 12:02:02 quanzeng-PowerEdge-T420 systemd[1]: slurmd.service: Unit entered failed state. Sep 30 12:02:02 quanzeng-PowerEdge-T420 systemd[1]: slurmd.service: Failed with result 'timeout'.

@hmamine I had this issue,
And, use configurator.html not configurator.easy.html solve it

@Lihua1990
Copy link

Sorry I'm not an expert on this. After one rebooting, I restart the services and I also have the problem:

"slurm_load_partitions: Unable to contact slurm controller (connect failure)"

no idea why and still not solved.

@frankliuao
Copy link

Thanks for kindly sharing.
I had some trouble working with the instructions you provided but couldn't find similar issues elsewhere. My problem was that whenever I tried to start the slurmd service, it gave me an error of

(base) frankliuao@Lorentz:~$ sudo service slurmd start
Job for slurmd.service failed because the control process exited with error code.
See "systemctl status slurmd.service" and "journalctl -xe" for details.

Then I figured out from their website that the actual log of Slurm is being stored in

/var/log/slurm

So I read the errors

error: cannot find proctrack plugin for linuxproc
and realized that ProctrackType should be set to proctrack/linuxproc, instead of just linuxproc.

Same for JobAcctGatherType, not Linux, but jobacct_gather/linux, see
Official doc

After that, the service could be started.

@frankliuao
Copy link

To add to my comment above, one also needs to change AccountingStorageType to accounting_storage/filetxt instead of filetxt. The recommended value accounting_storage/slurmdbd didn't work because it relies on slurmdbd service, which I wasn't able to get running in a short time.

@v-iashin
Copy link

v-iashin commented Apr 14, 2022

To add GPUs to your cluster do the following (I assume your machine has NVidia drivers):

  1. Open /etc/slurm-llnl/slurm.conf. Uncomment and change #GresTypes= to GresTypes=gpu and add Gres=gpu:3 among specifications of a compute node at the bottom, e.g.:
NodeName=<your node name> CPUs=XX RealMemory=XXXXX Gres=gpu:3 Sockets=X CoresPerSocket=XX ThreadsPerCore=X State=XXXXXXX
  1. Create /etc/slurm-llnl/gres.conf and add:
# e.g. for 3 GPUs
NodeName=<your node name> Name=gpu Type:2080ti File=/dev/nvidia0
NodeName=<your node name> Name=gpu Type:2080ti File=/dev/nvidia1
NodeName=<your node name> Name=gpu Type:2080ti File=/dev/nvidia2
  1. Restart the cluster: sudo service slurmd restart && sudo service slurmctld restart
  2. Test that a job allocates the GPUs: srun -N 1 --gres=gpu:2080ti:2 env | grep CUDA --> CUDA_VISIBLE_DEVICES=0,1

@sandeep1143
Copy link

sandeep1143 commented Apr 30, 2022

Hi,
Thanks this has worked successfully but when i check status like,

sandeep@sandeep-VirtualBox:~$ systemctl status slurmd.service
● slurmd.service - Slurm node daemon
Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
Active: active (running) since Sat 2022-04-30 22:50:47 IST; 9min ago
Docs: man:slurmd(8)
Process: 12931 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 12933 (slurmd)
Tasks: 1
Memory: 4.1M
CGroup: /system.slice/slurmd.service
└─12933 /usr/sbin/slurmd

Apr 30 22:50:47 sandeep-VirtualBox systemd[1]: Starting Slurm node daemon...
Apr 30 22:50:47 sandeep-VirtualBox slurmd-sandeep-VirtualBox[12931]: Node reconfigured socket/core boundaries SocketsPerBoard=4:1(hw) CoresPe>
Apr 30 22:50:47 sandeep-VirtualBox slurmd-sandeep-VirtualBox[12931]: Message aggregation disabled
Apr 30 22:50:47 sandeep-VirtualBox slurmd-sandeep-VirtualBox[12931]: CPU frequency setting not configured for this node
Apr 30 22:50:47 sandeep-VirtualBox systemd[1]: slurmd.service: Can't open PID file /run/slurmd.pid (yet?) after start: Operation not permitted
Apr 30 22:50:47 sandeep-VirtualBox slurmd-sandeep-VirtualBox[12933]: slurmd version 19.05.5 started
Apr 30 22:50:47 sandeep-VirtualBox slurmd-sandeep-VirtualBox[12933]: slurmd started on Sat, 30 Apr 2022 22:50:47 +0530
Apr 30 22:50:47 sandeep-VirtualBox systemd[1]: Started Slurm node daemon.
Apr 30 22:50:47 sandeep-VirtualBox slurmd-sandeep-VirtualBox[12933]: CPUs=4 Boards=1 Sockets=1 Cores=4 Threads=1 Memory=7951 TmpDisk=82909 Up>
Apr 30 22:50:56 sandeep-VirtualBox slurmd-sandeep-VirtualBox[12933]: error: Unable to register: Unable to contact slurm controller (connect f>
lines 1-21/21 (END)

last i got cant open pid file error and unable to contact slurm controller. Can you help what to do for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment