Skip to content

Instantly share code, notes, and snippets.

@stefan-it
Last active April 9, 2023 21:05
Show Gist options
  • Star 14 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save stefan-it/0a61c0625cc1f37425e9233f95332630 to your computer and use it in GitHub Desktop.
Save stefan-it/0a61c0625cc1f37425e9233f95332630 to your computer and use it in GitHub Desktop.
TPU VM Cheatsheet

TPU VM Cheetsheat

This TPU VM cheatsheet uses and was tested with the following library versions:

Library Version
JAX 0.3.25
FLAX 0.6.4
Datasets 2.10.1
Transformers 4.27.1
Chex 0.1.6

Please note that it could work with later versions - but it's not guaranteed ;)

Create disk with additional storage

gcloud compute disks create lms --zone us-central1-a --size 1024G

Make sure, that your disk is in the same zone as your TPU VM!

Create v3-8 TPU VM

The following commands creates a v3-8 TPU VM and attaches the previously created disk to it:

gcloud alpha compute tpus tpu-vm create lms --zone us-central1-a --accelerator-type v3-8 \
--version tpu-vm-base --data-disk source=projects/<project-name>/zones/us-central1-a/disks/lms

SSH into TPU VM

Just run the following command to SSH into the TPU VM:

gcloud alpha compute tpus tpu-vm ssh lms --zone us-central1-a 

Installation of libraries

After ssh'ing into TPU VM, run the following commands in e.g. tmux.

sudo apt update -y && sudo apt install -y python3-venv
python3 -m venv $HOME/dev
source $HOME/dev/bin/activate
pip install "jax[tpu]==0.3.25" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
pip install ipython requests
git clone https://github.com/huggingface/transformers.git
git clone https://github.com/huggingface/datasets.git
git clone https://github.com/google/flax.git
cd transformers && git checkout v4.27.1 && pip3 install -e . && cd ..
cd datasets && git checkout 2.10.1 && pip3 install -e . && cd ..
cd flax && git checkout v0.6.4 && pip3 install -e . && cd ..
pip install chex==0.1.6

# Useful symlinks
ln -s $HOME/transformers/examples/flax/language-modeling/run_bart_dlm_flax.py run_bart_dlm_flax.py
ln -s $HOME/transformers/examples/flax/language-modeling/run_clm_flax.py run_clm_flax.py
ln -s $HOME/transformers/examples/flax/language-modeling/run_mlm_flax.py run_mlm_flax.py
ln -s $HOME/transformers/examples/flax/language-modeling/run_t5_mlm_flax.py run_t5_mlm_flax.py

Format and mount disk

The attached disk needs to formatted first using:

sudo mkfs.ext4 -m 0 -E lazy_itable_init=0,lazy_journal_init=0,discard /dev/sdb

After that it can be mounted via:

sudo mkdir -p /mnt/datasets
sudo mount -o discard,defaults /dev/sdb /mnt/datasets/
sudo chmod a+w /mnt/datasets

HF Datasets Cache

The HF dataset cache variable should now point to the mounted disk:

export HF_DATASETS_CACHE=/mnt/datasets/huggingface

Optional: Create swapfile

The following commands create and activate a swapfile:

cd /mnt/datasets
sudo fallocate -l 50G ./swapfile
sudo chmod 600 ./swapfile
sudo mkswap ./swapfile
sudo swapon ./swapfile

Optional: TensorBoard

Install TensorBoard to get better training metric visualizations:

pip install tensorboard tensorflow

Note: Installing tensorflow avoid the following warning:

[21:19:25] - WARNING - __main__ - Unable to display metrics through TensorBoard because some package are not installed: No module named 'tensorflow'
@vladtermene
Copy link

When creating the disk, the unit of the disk size shouldn't be GB?

@jbrry
Copy link

jbrry commented Aug 29, 2022

Awesome, saved me a lot of time as usual! For anyone using TPU v4, they just need to update the --accelerator-type and --version fields like in the below:

gcloud alpha compute tpus tpu-vm create lms --zone us-central1-a --accelerator-type v4-8 \ 
--version tpu-vm-v4-base --data-disk source=projects/<project-name>/zones/us-central1-a/disks/lms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment