Reference: https://supercloud.mit.edu/
- SuperCloud: Research for MIT Lincoln Laboratory (224 GPU nodes with 2 Nvidia Volta V100 each, and 480 CPU nodes)
- Account: Fill out the form and complete HPC course for full access
- Help: go https://supercloud.mit.edu/getting-help or contact supercloud@mit.edu
- Office hours: Fridays 2-3pm EST (https://mit.zoom.us/j/173993735)
- Monthly downtimes: Third Thursday of each month
Reference: https://supercloud.mit.edu/best-practices-and-performance-tips
- User directory:
/home/gridsan/USERNAME
- Aim for less than ~1000 files per directory
Reference: https://supercloud.mit.edu/software-and-package-management
Reference: https://supercloud.mit.edu/submitting-jobs#interactive
- Log into the supercloud login node:
ssh USERNAME@txe1-login.mit.edu
- Submit an interactive job (with one gpu) using:
LLsub -i -s 20 -g volta:1
- Once available, you will automatically login to that node. If you exit then, the node will be terminated.
Reference: https://supercloud.mit.edu/submitting-jobs#serial
- Log into the supercloud login node:
ssh USERNAME@txe1-login.mit.edu
- Write the following to a file called
myScript.sh
#!/bin/bash
#SBATCH -J USERNAME
#SBATCH -o %j.stdout
#SBATCH -e %j.stderr
#SBATCH -c 20
#SBATCH --gres=gpu:volta:1
#SBATCH --time=24:00:00
# Write your commands here
# Write this line to use conda
source /state/partition1/llgrid/pkg/anaconda/anaconda3-2022a/etc/profile.d/conda.sh
conda activate nn_pde_new
cd /home/gridsan/ymeng/mit/hybrid_clf/DeepRL_Algorithms
python global_patch.py --algor sac --exp_name bp --gpus 0 --env_id BpEnv-v0 --random_seed 20221007
- Then submit the job:
LLsub myScript.sh
(or:sbatch myScript.sh
) - Check the job status using:
LLstat
- If the job is assigned with "NODELIST=d-14-7-2", you can login the compute node using:
ssh d-14-7-2
- (In the login node) You can stop a job manually by:
LLkill JOBID
- You can view the stdout (or stderr) files:
cat JOBID.stdout
(or:cat JOBID.stderr
) - You can also append the following to your
.bashrc
file andsource ~/.bashrc
, and then you can run like:run_job python global_patch.py --algor sac --exp_name bp --gpus 0 --env_id BpEnv-v0 --random_seed 20221007
and check stdout/stderr bycato JOBID
(cate JOBID
)
function run_job {
echo "#!/bin/bash
#SBATCH -J USERNAME
#SBATCH -o %j.stdout
#SBATCH -e %j.stderr
#SBATCH -c 20
#SBATCH --gres=gpu:volta:1
#SBATCH --time=24:00:00
# write your commands here
source /state/partition1/llgrid/pkg/anaconda/anaconda3-2022a/etc/profile.d/conda.sh
conda activate nn_pde_new
cd /home/gridsan/ymeng/mit/hybrid_clf/DeepRL_Algorithms
$@
" > ~/tmp.slurm
mkdir -p ~/.lsf/
cd ~/.lsf/
sbatch ~/tmp.slurm
cd -
}
function cato {
cat ~/.lsf/$1.stdout
}
function cate {
cat ~/.lsf/$1.stderr
}