git clone https://github.com/mlfoundations/open_clip.git
cd open_clip
python3.8 -m venv .env
source .env/bin/activate
pip install -U pip
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113
pip install -e .
pip install braceexpand pandas webdataset
Update: I rather advise to use https://github.com/mlfoundations/open_clip/blob/main/docs/script_examples/stability_example.sh
wget https://gist.githubusercontent.com/rom1504/0d6b7e4e49626109a5a8e1c59a4e1aa6/raw/c73aa74c65f42def14cfee6bf8b25438ebd4e11e/start_in_container.sh
wget https://gist.githubusercontent.com/rom1504/0d6b7e4e49626109a5a8e1c59a4e1aa6/raw/c73aa74c65f42def14cfee6bf8b25438ebd4e11e/start_openclip.sh
sbatch start_openclip.sh
squeue -u your_user
ls -lt | head -10
to find the log file then less thefile
find one host in squeue then ssh thehost
, then nvidia-smi, htop
--train-data 'pipe:s3cmd get -q s3://s-datasets/laion5b/laion2B-data/{000000..231349}.tar -' \
--train-num-samples 2170337258 \
--train-data="pipe:aws s3 cp s3://s-datasets/laion400m/laion400m-dat-release/{00000..41455}.tar -" \
--train-num-samples 413000000 \
Thanks to @rwightman for helping me fix this!
#SBATCH --requeue
checkpoint_path=`ls -t /fsx/rom1504/open_clip/src/logs/*ViT-g-14*/checkpoints/* | head -1`
--resume $checkpoint_path \
|| sbatch /fsx/rom1504/open_clip/good.sh
You can also use
squeue -u your_user
to see what's going on and it includes the header.