Markus92/serversetup.md

## serversetup.md

      
    Raw
  

              serversetup.md
            
          
    Setting up a GPU server with scheduling and containers

Our group recently acquired a new server to do some deep learning: a SuperMicro 4029GP-TRT2, stuffed with 8x NVidia RTX 2080 Ti. Though maybe a bit overpowered, with upcoming networks like BigGAN and fully 3D networks, as well as students joining our group, this machine will be used quite a lot in the future.
One challenge is, is how to manage these GPUs. There are many approaches, but given that most PhD candidates aren't sysadmins, these range from 'free-for-all', leading to one person hogging all GPUs for weeks due to a bug in the code, to Excel sheets that noone understands and noone adheres to because changing GPU ids in code is hard. This leads to a lot of frustration, low productivity and under-utilisation of these expensive servers. Another issue is conflicting software versions. TensorFlow and Keras, for example, tend to do breaking API changes every now and then. As these always happen right before a conference deadline, this leads to even more frustration when trying to run a few extra experiments.
A group at a previous affiliation of mine had the same problems and used Docker containers with a job scheduler to mitigate most of these problem. Unfortunately I had never used it myself and was thus not familiar with the exact details of their implementation. This approach solves most of our problems: no conflicting software versions (just roll a container per research paper and archive it), no competing for GPUs and, most importantly, people can't accidentally screw up colleagues' experiments.
There was one more constraint I had: our system should be as easy to use as possible. When I talked about 'job scheduling' and 'GPU allocation' to my colleagues, the reaction I got was that they were scared it'd be either too complicated to use, or too limited to use. As I really didn't want to go the 'Google Sheets' route for GPU scheduling, I kept this in mind during the design of the system. Another constraint was set by our sysadmin: no root for users, as we got some legacy NFSv3 fileservers which authenticate on UID/GID level. This immediately excluded Docker, as described here. As users would be allowed to control the Docker daemon, it's pretty much the same as putting everyone in the group 'sudoers'. Not something we want, to be honest.
In the end I decided to use a combination of Singularity and SLURM. Singularity is a container tool created for and used a lot by HPC facilities. SLURM is an industry-standard job scheduler, also used by many HPC instances. An advantage of these tools is that they are industry-standard, thus used a lot and well-documented. Always helpful when running into problems. To enforce proper usage of these tools, control groups (cgroups) are used to lock access down: by default there are no GPU permissions.
As most tutorials I found were quite outdated, here's a new one. It's in typical 'follow-along' style, so you can copy/paste the commands onto your own terminal and you'll end up with a similar system! Root access is required, obviously.
Note: we are running a new, fresh install of Ubuntu 18.04 LTS.
Installing Singularity

As most debian packages for Singularity are quite outdated, we'll compile it ourselves. It's written in Go, so we'll also install a recent Go version.
First, install some standard packages for compiling stuff.
$ sudo apt-get update && \
    sudo apt-get install -y \
    python \
    git \
    dh-autoreconf \
    build-essential \
    libarchive-dev \
    libssl-dev \
    uuid-dev \
    libgpgme11-dev \
    squashfs-tools
$ wget https://dl.google.com/go/go1.12.6.linux-amd64.tar.gz
$ sudo tar -xvf go1.12.6.linux-amd64.tar.gz
$ sudo mv go /usr/local
$ /usr/local/go/bin/go version
To make sure the GOPATH is set for everyone, I created a new script in /etc/profile.d
$ sudo nano /etc/profile.d/dl_paths.sh
And the script:
GOROOT="/usr/local/go"

export GOROOT=${GOROOT}
export GOPATH=$HOME/go
export PATH=$GOROOT/bin:$PATH
To test this, logout and log back in again (or just reboot).
export
go env
This should give some output of Go.
Next step is compiling Singularity itself. First get dep, then Singularity. Obviously change v3.5.2 to any later version if you want. Take a look at their github tags for more info.
go get -u github.com/golang/dep/cmd/dep
go get -d github.com/sylabs/singularity
cd $GOPATH/src/github.com/sylabs/singularity
git checkout v3.5.2
It'll complain a bit about no Go files being there, but still does its job.
Now compile time, this will take a few minutes:
./mconfig
make -j10 -C builddir
sudo make -C ./builddir install
You should be done now! Let's test it:
singularity version
And the output should be 3.5.2 or the version you picked before.
We're going to make a few changes to the default configuration, mainly to make it easier for our users. We'll add a few bind points and change a few defaults to make the containers as transparent as possible.
$sudo nano /usr/local/etc/singularity/singularity.conf
First, to bind the NVIDIA binaries into every container, change always use nv = no to yes. It doesn't really have any downsides, just saves you from typing --nv every time.
Second, we add a few bind paths. Obviously these are user specific, though /run/user is useful for everyone running a systemd-based distribution like Ubuntu or Debian. I added these below the standard bind paths, you'll find it easily in the config file.
# For temporary files
bind path = /run/user
# Mounts to data
bind path = /raid
And finally a test run (this might take a while as the container is HUGE.)
$ cd ~
$ singularity exec docker://nvcr.io/nvidia/pytorch:19.05-py3 jupyter notebook
SLURM

For GPU scheduling, we use SLURM. Unfortunately the packages in Ubuntu and Debian are a bit too outdated, so we'll compile our own version. First install some dependencies. Note that we'll install the cgroup stuff right away.
sudo apt-get install build-essential ruby-dev libpam0g-dev libmysqlclient-dev munge libmunge-dev libmysqld-dev cgroup-bin libpam-cgroup cgroup-tools
Then download, extract and compile. My machine has many cores so we'll use some multi-threading in the make. Depending on your computer, you might have enough time to grab and drink some coffee.
wget https://download.schedmd.com/slurm/slurm-19.05.0.tar.bz2
tar -xaf slurm-19.05.0.tar.bz2 
cd slurm-19.05.0/
./configure --sysconfdir=/etc/slurm --enable-pam --localstatedir=/var --with-munge --with-ssl
make -j10
sudo make install
Logout/login, then check if it actually does something.
srun
You'll get an error about the configuration file not existing,
Now start and enable munge.
sudo systemctl enable munge
sudo systemctl start munge
sudo systemctl status munge
Copy systemd files and enable them, create user for SLURM.
cd ~/slurm-19.05.0/etc
sudo cp *.service /lib/systemd/system/
sudo adduser --system --no-create-home --group slurm
sudo systemctl enable slurmd
sudo systemctl enable slurmctld
sudo systemctl enable slurmdbd
We can't start them yet because we don't have a slurm.conf file yet.
There is a generator to make one, but I'll drop my own slurm.conf file here below later.
We also need mysql for accounting. This isn't the most desirable application you can install (for security reasons), but nowadays the defaults of mysql 5.7 at Ubuntu 18.04 are pretty sane (no more guest access, no empty root password).
sudo DEBIAN_FRONTEND=noninteractive apt-get install -y mysql-server pwgen
Use pwgen two generate two passwords: one for the mysql root user, one for the slurm user.
pwgen 16 2
Write them down or store them somewhere. Now open a mysql shell:
mysql
Then run these commands in the shell: Replace your_secure_password
with one of the password generated by pwgen above.
create user 'slurm'@'localhost';
set password for 'slurm'@'localhost' = 'your_secure_password';
grant usage on *.* to 'slurm'@'localhost';
create database slurm_acct_db;
grant all privileges on slurm_acct_db.* to 'slurm'@'localhost';
flush privileges;
exit
Now it's time for the configuration files. There's two:

slurmdbd.conf which is for the database daemon
slurmd.conf which is the generic slurm configuration

I'll start with slurmdbd.conf and will just copypaste them here.
Put them in /etc/slurm/. Don't forget to replace the password!
# SLURMDB config file
#  Created by Mark Janse 2019-06-18
# logging level
ArchiveEvents=no
ArchiveJobs=yes
ArchiveSteps=no
ArchiveSuspend=no

# service
DbdHost=localhost
SlurmUser=slurm
AuthType=auth/munge

# logging; remove this to use syslog
LogFile=/var/log/slurm-llnl/slurmdbd.log

# database backend
StoragePass=your_secure_password
StorageUser=slurm
StorageType=accounting_storage/mysql
StorageLoc=slurm_acct_db

And here's the slurm.conf. I'll assume hostname turing for the main pc. The name of the cluster is bip-cluster, but isn't really too important:
At the bottom I also define the node, ours has 8 GPUs, 2 CPUs, 10 cores per CPU and 2 threads per core. Change these to your own liking.
#def dslurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#

# Set your hostname here!
SlurmctldHost=turing
#
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/spool/slurm
SwitchType=switch/none
TaskPlugin=task/cgroup
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
SlurmctldTimeout=600
SlurmdTimeout=600
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageEnforce=associations
ClusterName=bip-cluster
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
#SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
#SlurmdDebug=info
SlurmdLogFile=/var/log/slurm/slurmd.log
#
#
# COMPUTE NODES
# NodeName=turing Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN

# Partitions
GresTypes=gpu
NodeName=turing Gres=gpu:8 Sockets=2 CoresPerSocket=10 ThreadsPerCore=2 State=UNKNOWN
PartitionName=tu102 Nodes=turing Default=YES MaxTime=96:00:00 MaxNodes=2 DefCpuPerGPU=5 State=UP

For GPU scheduling you also need a gres.conf. This changes per machine if you have different amounts of GPUs. In our case, there is only one machine with 8 GPUs.
#  Defines all 8 GPUs on Turing
Name=gpu File=/dev/nvidia[0-7]

Restricting unauthorized GPU access

Previously, we already installed several tools to enable cgroups to work. Now we're going to make them work.
First we create the file cgconfig.conf. See below for contents. We create a group nogpu for processes without gpu access, and a group gpu for processes which can access the GPU.
Location of the file is /etc/cgconfig.conf
# Below restricts access to NVIDIA devices for all users in this cgroup
#  Number 195 is documented in kernel for NVIDIA driver stuff

group nogpu {
    devices {
        devices.deny = "c 195:* rwm";
    }
}

# Opposite of above, just to be sure
group gpu {
    devices {
        devices.allow = "c 195:* rwm";
    }
}


For admin tasks, you might want to create a usergroup which will always have GPU access.
sudo groupadd gpu
sudo usermod -aG gpu mark
I'd advise to add every user with root access to this group for administration tasks. Do not add any regular users to it, or it'll break the purpose of the scheduling system as they'll have unlimited GPU access, always.
To load these cgroups every time the system boots, we'll run cgconfigparser on boot. Let's create a small systemd script to do this:
sudo nano /lib/systemd/system/cgconfigparser.service
And copy-paste below file in there:
[Unit]
Description=cgroup config parser
After=network.target

[Service]
User=root
Group=root
ExecStart=/usr/sbin/cgconfigparser -l /etc/cgconfig.conf
Type=oneshot

[Install]
WantedBy=multi-user.target


And run the command sudo systemctl enable cgconfigparser.service after.
This will now be run every time on boot. So reboot the system.
To move user processes into the right group, we edit
/etc/pam.d/common-session.
Add below line to the bottom of the file:
session optional        pam_cgroup.so

The pam reads the file /etc/cgrules.conf so create that. I added it here below:
# /etc/cgrules.conf
#The format of this file is described in cgrules.conf(5)
#manual page.
#
# Example:
#<user>         <controllers>   <destination>
#@student       cpu,memory      usergroup/student/
#peter          cpu             test1/
#%              memory          test2/
# End of file

root            devices         /
user            devices         /
@gpu            devices         /gpu
*               devices         /nogpu

This will move all users in the group gpu to GPU access, and everyone else to no GPU access. Exactly what we want.
Now reboot for the final time and you're done!
Post-mortem

This system has been up and running for around a year now, and it works perfectly:
the system had only two short outages. One was caused by time-out of the SLURM
daemon, killing all running jobs for some reason (new jobs were fine). This
is mitigated now by setting the time-outs a bit less tight.
The other one, we have no clue what happened. It was a total hardware lockup,
even the physical console didn't respond. A quick physical reboot later and
everything was up and running again like before!