Skip to content

Instantly share code, notes, and snippets.

@vict0rsch
Last active June 18, 2020 13:07
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save vict0rsch/d4d32f6f3dc60d19bff6b2a375202512 to your computer and use it in GitHub Desktop.
Save vict0rsch/d4d32f6f3dc60d19bff6b2a375202512 to your computer and use it in GitHub Desktop.
Mila's cluster cheat sheet

Mila Cluster Cheat Cheat

This is merely an introduction to get you started quickly. This does not replace Mila's official cluster documentation => docs.mila.quebec which you should definitely go through, thoroughly before you start using the cluster.

Before starting: get your Mila cluster account ready. Should look something like surname[:6]firstname[0] but this may vary, it’s not a necessary condition.


Connecting to Mila’s Cluster

Install an ssh client on your computer. Then:

#generic login will send you to one of the 4 login nodes to spread the load
ssh user@login.server.mila.quebec -p 2222
#loginX, X in [1, 2, 3, 4]
ssh user@login-X.login.server.mila.quebec -p 2222

You can also use a config file > https://mila-umontreal.slack.com/archives/C2TL9FRBP/p1581462695079500 to just do ssh mila

I also suggest looking into stfp software such as FileZilla to send data back and forth between your computer and the cluster.

Cluster Usage

Overview

In order to allow many users to simulataneously manage their code, data and exectue compute-heavy scripts (such as a training procedure), the cluster is split in two major components: login nodes (there are 4 of them) which are an entry point for you to access the cluster, and compute nodes which provide you with the CPUs + RAM + GPU you'll need for your work and which you'll request from login nodes.

Login Nodes

After ssh you’ll land on a so-called login node (1, 2, 3 or 4). Don’t ever run anything compute-heavy on it. You could bring the node down, preventing everyone else from accessing the cluster.

This means no python script or anything like that, but also no zip on large folders and so on. Nothing that’s computationally heavy. Rule of thumbs: execution can’t take more than a few seconds.

Compute Nodes

Your computations should be done on a compute node, which you request through SLURM. This software manages requests and allocates ressources to the cluster's hundreds of users.

  1. run kinit to make sure your "ticket" is up to date (docs link)
  2. run sbatch job.sh to put your job in the queue (docs link)
  3. run srun <options> to get an interactive session and play around with the compute node with a terminal in it (docs link)
  4. go to https://jupyterhub.server.mila.quebec to get notebooks (docs link)

Partitions

Jobs are allocated according to their priority in the queue of requested jobs by all users. You can request one of 3 types of allocations:

  1. unkillable is the most stable: you only get 1 job with such a partition. Jobs requesting -p unkillable will not be preempted (=stopped to give the GPU to another job with higher priority). You have this node for up to 24h. You cannot request more than 4 cpus 48GB of RAM and 1 GPU
  2. main is a high priority job: it may be preempted but it's very unlikely: you get 2 jobs with such a partition. You have this node for up to 48h and the total cpus requested by your main jobs cannot exceed 12.
  3. long is a low priority job: it may be preempted so you should have regular checkpoints. There are no limits on the resources you can request, but bear in mind that the more resources you request, the more likely that they will be needed by a higher priority job and yours will be killed to release them.

If you do not specify -p (or, equivalently —partition) then the default is long.

Python

You should use anaconda, which is available on the cluster through the module tool (docs link)

  1. $ module load anaconda/3
  2. $ source $CONDA_ACTIVATE
  3. (base) $ conda create --name 'myenv' python=3.7.4
  4. (base) $ conda activate myenv
  5. (myenv) $ pip install -r requirements.txt

If you have Cuda issues, remember you can load the appropriate Cuda and CuDnn versions from module (and check that you're actually on a compute node!)

Examples
srun --gres=gpu:titanxp:1 --cpus-per-task=4 --mem=16GB --pty bash #starts an interactive job with a bash shell

job.sh:

#!/bin/bash
#SBATCH --cpus-per-task=4  # request cpus
#SBATCH --gres=gpu:titanxp:1  # request 1 titanxp
#SBATCH --mem=16GB  # RAM memory, NOT gpu memory
#SBATCH -o /network/tmp1/<user>/slurm_outputs/slurm-%j.out  # will write the prints etc. in this file where `%j` will be the job's id (like 328892)
#SBATCH -p main # partition

cd $SLURM_TMPDIR && cp path/to/data.zip . && unzip data.zip # best pratcice, may be too long depending on your data's size

module load anaconda/3 >/dev/null 2>&1
. "$CONDA_ACTIVATE"
conda activate myenv

echo "Starting job"

cd path/to/code

python train.py --data_path=$SLURM_TMPDIR

echo 'done'

$SLURM_TMPDIR is the path to a large, I/O efficient temporary folder attached to the compute node for the current job only so you should read and write from there but remember to move anything you want to keep to a permanent storage space like /network/tmp1 or /miniscratch.

Connecting to a running compute node

Before the job even started you need to have these steps ready:

  1. generate an rsa key
  2. copy the public one in ~/.ssh/authorized_keys (create that text file if it does not exist)
  3. check the permssions (use chmod -R xxx to change them):
    1. .ssh should be 700
    2. your public keys .ssh/*.pub may be 644
    3. private keys .ssh/*, authorized keys list .ssh/authorized_keys and config file .ssh/config must be 600

(for reference: https://mila-umontreal.slack.com/archives/CFAS8455H/p1581527219118200)

Storing data

Imporant read: https://docs.mila.quebec/mila-cluster/index.html#milacluster-storage

TL;DR:

  1. Only put code and packages in $HOME = ~ = /newtwork/home/<user>
  2. Store large chunks of data (processed datasets, checkpoints etc.) in tmp1: /network/tmp1/<user> or /miniscratch/<user>
  3. Do your computations on the temporary directory that you get when allocated a compute node: $SLURM_TMPDIR

Job status

  • squeue will give you all currenlty enqueued (running, pending etc.) jobs on the cluster
    • squeue -u $USER will give you your currently enqueued jobs
  • savail will give you currently available GPUs
  • scancel <job_id> will kill a job (it has up to 32 seconds to terminate after the command)
    • scancel -u $USER will kill all your currently enqueued jobs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment