Skip to content

Instantly share code, notes, and snippets.

@ckandoth
Last active May 19, 2024 07:49
Show Gist options
  • Save ckandoth/2acef6310041244a690e4c08d2610423 to your computer and use it in GitHub Desktop.
Save ckandoth/2acef6310041244a690e4c08d2610423 to your computer and use it in GitHub Desktop.
Install Slurm 19.05 on a standalone machine running Ubuntu 20.04

Use apt to install the necessary packages:

sudo apt install -y slurm-wlm slurm-wlm-doc

Load file:///usr/share/doc/slurm-wlm/html/configurator.html in a browser (or file://wsl%24/Ubuntu/usr/share/doc/slurm-wlm/html/configurator.html on WSL2), and:

  1. Set your machine's hostname in SlurmctldHost and NodeName.
  2. Set CPUs as appropriate, and optionally Sockets, CoresPerSocket, and ThreadsPerCore. Use command lscpu to find what you have.
  3. Set RealMemory to the number of megabytes you want to allocate to Slurm jobs,
  4. Set StateSaveLocation to /var/spool/slurm-llnl.
  5. Set ProctrackType to linuxproc because processes are less likely to escape Slurm control on a single machine config.
  6. Make sure SelectType is set to Cons_res, and set SelectTypeParameters to CR_Core_Memory.
  7. Set JobAcctGatherType to Linux to gather resource use per job, and set AccountingStorageType to FileTxt.

Hit Submit, and save the resulting text into /etc/slurm-llnl/slurm.conf i.e. the configuration file referred to in /lib/systemd/system/slurmctld.service and /lib/systemd/system/slurmd.service.

Load /etc/slurm-llnl/slurm.conf in a text editor, uncomment DefMemPerCPU, and set it to 8192 or whatever number of megabytes you want each job to request if not explicitly requested using --mem during job submission. Read the docs and edit other defaults as you see fit.

Create /var/spool/slurm-llnl and /var/log/slurm_jobacct.log, then set ownership appropriately:

sudo mkdir -p /var/spool/slurm-llnl
sudo touch /var/log/slurm_jobacct.log
sudo chown slurm:slurm /var/spool/slurm-llnl /var/log/slurm_jobacct.log

Install mailutils so that Slurm won't complain about /bin/mail missing:

sudo apt install -y mailutils

Make sure munge is installed and running, and a munge.key was created with user-only read-only permissions, owned by munge:munge:

sudo service munge start
sudo ls -l /etc/munge/munge.key

Start services slurmctld and slurmd:

sudo service slurmd start
sudo service slurmctld start
@frankliuao
Copy link

To add to my comment above, one also needs to change AccountingStorageType to accounting_storage/filetxt instead of filetxt. The recommended value accounting_storage/slurmdbd didn't work because it relies on slurmdbd service, which I wasn't able to get running in a short time.

@v-iashin
Copy link

v-iashin commented Apr 14, 2022

To add GPUs to your cluster do the following (I assume your machine has NVidia drivers):

  1. Open /etc/slurm-llnl/slurm.conf. Uncomment and change #GresTypes= to GresTypes=gpu and add Gres=gpu:3 among specifications of a compute node at the bottom, e.g.:
NodeName=<your node name> CPUs=XX RealMemory=XXXXX Gres=gpu:3 Sockets=X CoresPerSocket=XX ThreadsPerCore=X State=XXXXXXX
  1. Create /etc/slurm-llnl/gres.conf and add:
# e.g. for 3 GPUs
NodeName=<your node name> Name=gpu Type:2080ti File=/dev/nvidia0
NodeName=<your node name> Name=gpu Type:2080ti File=/dev/nvidia1
NodeName=<your node name> Name=gpu Type:2080ti File=/dev/nvidia2
  1. Restart the cluster: sudo service slurmd restart && sudo service slurmctld restart
  2. Test that a job allocates the GPUs: srun -N 1 --gres=gpu:2080ti:2 env | grep CUDA --> CUDA_VISIBLE_DEVICES=0,1

@sandeep1143
Copy link

sandeep1143 commented Apr 30, 2022

Hi,
Thanks this has worked successfully but when i check status like,

sandeep@sandeep-VirtualBox:~$ systemctl status slurmd.service
● slurmd.service - Slurm node daemon
Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
Active: active (running) since Sat 2022-04-30 22:50:47 IST; 9min ago
Docs: man:slurmd(8)
Process: 12931 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 12933 (slurmd)
Tasks: 1
Memory: 4.1M
CGroup: /system.slice/slurmd.service
└─12933 /usr/sbin/slurmd

Apr 30 22:50:47 sandeep-VirtualBox systemd[1]: Starting Slurm node daemon...
Apr 30 22:50:47 sandeep-VirtualBox slurmd-sandeep-VirtualBox[12931]: Node reconfigured socket/core boundaries SocketsPerBoard=4:1(hw) CoresPe>
Apr 30 22:50:47 sandeep-VirtualBox slurmd-sandeep-VirtualBox[12931]: Message aggregation disabled
Apr 30 22:50:47 sandeep-VirtualBox slurmd-sandeep-VirtualBox[12931]: CPU frequency setting not configured for this node
Apr 30 22:50:47 sandeep-VirtualBox systemd[1]: slurmd.service: Can't open PID file /run/slurmd.pid (yet?) after start: Operation not permitted
Apr 30 22:50:47 sandeep-VirtualBox slurmd-sandeep-VirtualBox[12933]: slurmd version 19.05.5 started
Apr 30 22:50:47 sandeep-VirtualBox slurmd-sandeep-VirtualBox[12933]: slurmd started on Sat, 30 Apr 2022 22:50:47 +0530
Apr 30 22:50:47 sandeep-VirtualBox systemd[1]: Started Slurm node daemon.
Apr 30 22:50:47 sandeep-VirtualBox slurmd-sandeep-VirtualBox[12933]: CPUs=4 Boards=1 Sockets=1 Cores=4 Threads=1 Memory=7951 TmpDisk=82909 Up>
Apr 30 22:50:56 sandeep-VirtualBox slurmd-sandeep-VirtualBox[12933]: error: Unable to register: Unable to contact slurm controller (connect f>
lines 1-21/21 (END)

last i got cant open pid file error and unable to contact slurm controller. Can you help what to do for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment