This is merely an introduction to get you started quickly. This does not replace Mila's official cluster documentation => docs.mila.quebec which you should definitely go through, thoroughly before you start using the cluster.
Before starting: get your Mila cluster account ready. Should look something like surname[:6]firstname[0]
but this may vary, it’s not a necessary condition.
Install an ssh
client on your computer. Then:
#generic login will send you to one of the 4 login nodes to spread the load
ssh user@login.server.mila.quebec -p 2222
#loginX, X in [1, 2, 3, 4]
ssh user@login-X.login.server.mila.quebec -p 2222
You can also use a config file > https://mila-umontreal.slack.com/archives/C2TL9FRBP/p1581462695079500 to just do ssh mila
I also suggest looking into stfp software such as FileZilla to send data back and forth between your computer and the cluster.
In order to allow many users to simulataneously manage their code, data and exectue compute-heavy scripts (such as a training procedure), the cluster is split in two major components: login nodes (there are 4 of them) which are an entry point for you to access the cluster, and compute nodes which provide you with the CPUs + RAM + GPU you'll need for your work and which you'll request from login nodes.
After ssh
you’ll land on a so-called login node (1
, 2
, 3
or 4
). Don’t ever run anything compute-heavy on it. You could bring the node down, preventing everyone else from accessing the cluster.
This means no python
script or anything like that, but also no zip
on large folders and so on. Nothing that’s computationally heavy. Rule of thumbs: execution can’t take more than a few seconds.
Your computations should be done on a compute node, which you request through SLURM. This software manages requests and allocates ressources to the cluster's hundreds of users.
- run
kinit
to make sure your "ticket" is up to date (docs link) - run
sbatch job.sh
to put your job in the queue (docs link) - run
srun <options>
to get an interactive session and play around with the compute node with a terminal in it (docs link) - go to https://jupyterhub.server.mila.quebec to get notebooks (docs link)
Jobs are allocated according to their priority in the queue of requested jobs by all users. You can request one of 3 types of allocations:
unkillable
is the most stable: you only get 1 job with such a partition. Jobs requesting-p unkillable
will not be preempted (=stopped to give the GPU to another job with higher priority). You have this node for up to 24h. You cannot request more than 4 cpus 48GB of RAM and 1 GPUmain
is a high priority job: it may be preempted but it's very unlikely: you get 2 jobs with such a partition. You have this node for up to 48h and the total cpus requested by yourmain
jobs cannot exceed 12.long
is a low priority job: it may be preempted so you should have regular checkpoints. There are no limits on the resources you can request, but bear in mind that the more resources you request, the more likely that they will be needed by a higher priority job and yours will be killed to release them.
If you do not specify -p
(or, equivalently —partition
) then the default is long
.
You should use anaconda
, which is available on the cluster through the module
tool (docs link)
$ module load anaconda/3
$ source $CONDA_ACTIVATE
(base) $ conda create --name 'myenv' python=3.7.4
(base) $ conda activate myenv
(myenv) $ pip install -r requirements.txt
If you have Cuda issues, remember you can load the appropriate Cuda and CuDnn versions from module
(and check that you're actually on a compute node!)
srun --gres=gpu:titanxp:1 --cpus-per-task=4 --mem=16GB --pty bash #starts an interactive job with a bash shell
job.sh
:
#!/bin/bash
#SBATCH --cpus-per-task=4 # request cpus
#SBATCH --gres=gpu:titanxp:1 # request 1 titanxp
#SBATCH --mem=16GB # RAM memory, NOT gpu memory
#SBATCH -o /network/tmp1/<user>/slurm_outputs/slurm-%j.out # will write the prints etc. in this file where `%j` will be the job's id (like 328892)
#SBATCH -p main # partition
cd $SLURM_TMPDIR && cp path/to/data.zip . && unzip data.zip # best pratcice, may be too long depending on your data's size
module load anaconda/3 >/dev/null 2>&1
. "$CONDA_ACTIVATE"
conda activate myenv
echo "Starting job"
cd path/to/code
python train.py --data_path=$SLURM_TMPDIR
echo 'done'
$SLURM_TMPDIR
is the path to a large, I/O efficient temporary folder attached to the compute node for the current job only so you should read and write from there but remember to move anything you want to keep to a permanent storage space like /network/tmp1
or /miniscratch
.
Before the job even started you need to have these steps ready:
- generate an rsa key
- copy the public one in
~/.ssh/authorized_keys
(create that text file if it does not exist) - check the permssions (use
chmod -R xxx
to change them):.ssh
should be700
- your public keys
.ssh/*.pub
may be644
- private keys
.ssh/*
, authorized keys list.ssh/authorized_keys
and config file.ssh/config
must be600
(for reference: https://mila-umontreal.slack.com/archives/CFAS8455H/p1581527219118200)
Imporant read: https://docs.mila.quebec/mila-cluster/index.html#milacluster-storage
TL;DR:
- Only put code and packages in
$HOME
=~
=/newtwork/home/<user>
- Store large chunks of data (processed datasets, checkpoints etc.) in
tmp1
:/network/tmp1/<user>
or/miniscratch/<user>
- Do your computations on the temporary directory that you get when allocated a compute node:
$SLURM_TMPDIR
squeue
will give you all currenlty enqueued (running, pending etc.) jobs on the clustersqueue -u $USER
will give you your currently enqueued jobs
savail
will give you currently available GPUsscancel <job_id>
will kill a job (it has up to 32 seconds to terminate after the command)scancel -u $USER
will kill all your currently enqueued jobs