Skip to content

Instantly share code, notes, and snippets.

@rom1504
Last active August 7, 2023 02:01
Show Gist options
  • Save rom1504/0d6b7e4e49626109a5a8e1c59a4e1aa6 to your computer and use it in GitHub Desktop.
Save rom1504/0d6b7e4e49626109a5a8e1c59a4e1aa6 to your computer and use it in GitHub Desktop.
open clip at slurm

Install

git clone https://github.com/mlfoundations/open_clip.git
cd open_clip
python3.8 -m venv .env
source .env/bin/activate
pip install -U pip
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
pip install -e .
pip install braceexpand pandas webdataset

Get slurm script

Update: I rather advise to use https://github.com/mlfoundations/open_clip/blob/main/docs/script_examples/stability_example.sh

wget https://gist.githubusercontent.com/rom1504/0d6b7e4e49626109a5a8e1c59a4e1aa6/raw/c73aa74c65f42def14cfee6bf8b25438ebd4e11e/start_in_container.sh
wget https://gist.githubusercontent.com/rom1504/0d6b7e4e49626109a5a8e1c59a4e1aa6/raw/c73aa74c65f42def14cfee6bf8b25438ebd4e11e/start_openclip.sh

Run the training

sbatch start_openclip.sh

Check what is going on

squeue -u your_user

ls -lt | head -10 to find the log file then less thefile

find one host in squeue then ssh thehost, then nvidia-smi, htop

Datasets

laion2b

--train-data 'pipe:s3cmd get -q s3://s-datasets/laion5b/laion2B-data/{000000..231349}.tar -' \
--train-num-samples 2170337258 \

laion400m

--train-data="pipe:aws s3 cp s3://s-datasets/laion400m/laion400m-dat-release/{00000..41455}.tar -" \
--train-num-samples 413000000 \

Thanks to @rwightman for helping me fix this!

Auto resume instructions

#SBATCH --requeue

checkpoint_path=`ls -t /fsx/rom1504/open_clip/src/logs/*ViT-g-14*/checkpoints/* | head -1`

--resume $checkpoint_path \

 || sbatch /fsx/rom1504/open_clip/good.sh
#!/bin/bash
#SBATCH --partition=compute-od-gpu
#SBATCH --job-name=openclip
#SBATCH --nodes 8
#SBATCH --ntasks-per-node 8
#SBATCH --cpus-per-gpu=6
#SBATCH --gres=gpu:8
#SBATCH --output=%x_%j.out
#SBATCH --exclusive
## Prefer using https://github.com/mlfoundations/open_clip/blob/main/docs/script_examples/stability_example.sh
module load intelmpi
source /opt/intel/mpi/latest/env/vars.sh
export LD_LIBRARY_PATH=/opt/aws-ofi-nccl/lib:/opt/amazon/efa/lib64:/usr/local/cuda-11.0/efa/lib:/usr/local/cuda-11.0/lib:/usr/local/cuda-11.0/lib64:/usr/local/cuda-11.0:/opt/nccl/build/lib:/opt/aws-ofi-nccl-install/lib:/opt/aws-ofi-nccl/lib:$LD_LIBRARY_PATH
export NCCL_PROTO=simple
export PATH=/opt/amazon/efa/bin:$PATH
export LD_PRELOAD="/opt/nccl/build/lib/libnccl.so"
export FI_EFA_FORK_SAFE=1
export FI_LOG_LEVEL=1
export FI_EFA_USE_DEVICE_RDMA=1 # use for p4dn
#export NCCL_ALGO=ring
export NCCL_DEBUG=info
#export NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH,COLL
export PYTHONFAULTHANDLER=1
export CUDA_LAUNCH_BLOCKING=0
export OMPI_MCA_mtl_base_verbose=1
export FI_EFA_ENABLE_SHM_TRANSFER=0
export FI_PROVIDER=efa
export FI_EFA_TX_MIN_CREDITS=64
export NCCL_TREE_THRESHOLD=0
#export NCCL_P2P_DISABLE=1
#export NCCL_IBEXT_DISABLE=1
#export NCCL_SOCKET_IFNAME="eth0,en,eth,em,bond"
# sent to sub script
export HOSTNAMES=`scontrol show hostnames "$SLURM_JOB_NODELIST"`
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_PORT=12802
export COUNT_NODE=`scontrol show hostnames "$SLURM_JOB_NODELIST" | wc -l`
echo go $COUNT_NODE
echo $HOSTNAMES
source /fsx/rom1504/open_clip/.env/bin/activate
cd /fsx/rom1504/open_clip/src/
srun --cpu_bind=v --accel-bind=gn python -m training.main \
--save-frequency 1 \
--train-data="pipe:aws s3 cp s3://s-datasets/laion400m/laion400m-dat-release/{00000..41455}.tar -" \
--train-num-samples 413000000 \
--dataset-type webdataset \
--warmup 2000 \
--batch-size=384 \
--epochs=32 \
--lr 5e-4 \
--workers=2 \
--model ViT-B-32 \
--seed 0 \
--local-loss \
--grad-checkpointing \
--ddp-static-graph \
--gather-with-grad \
--report-to wandb
@iejMac
Copy link

iejMac commented Jul 21, 2022

You can also use squeue -u your_user to see what's going on and it includes the header.

@rom1504
Copy link
Author

rom1504 commented Sep 7, 2022

indeed, changed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment