Skip to content

Instantly share code, notes, and snippets.

@philippbayer
Last active October 16, 2023 08:01
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save philippbayer/fd5ae95e7c95de4e960cff4f34e04965 to your computer and use it in GitHub Desktop.
Save philippbayer/fd5ae95e7c95de4e960cff4f34e04965 to your computer and use it in GitHub Desktop.
installing torch/transformers under ROCm on Pawsey

Here's my alias in .bashrc for getting a gpu-dev instance based on https://support.pawsey.org.au/documentation/display/US/Setonix+GPU+Partition+Quick+Start

alias getgpunode='salloc -p gpu-dev --nodes=1 --gpus-per-node=1 --account=${PAWSEY_PROJECT}-gpu'

First, to make a fresh environment:

mamba create -p `pwd`/transformers transformers python=3.10

Install Torch with the closest ROCm version (nothing for 5.4.3, the current 'new' version on Pawsey, and nothing for 5.2.3, the default version). Also setting the pip-cache-dir to somewhere on /scratch.

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2 --cache-dir `pwd`/pipcache

Test whether torch can see the GPU:

srun python -c 'import torch; print(torch.cuda.is_available())'

Should print True!

Install the 'right' Tensorflow:

pip install tensorflow-rocm

Seems to work? Now install transformers:

pip install transformers

Change transformers' cache to somewhere on /scratch:

export TRANSFORMERS_CACHE=`pwd`/tf_cache

I also had to upgrade accelerate:

pip install -U accelerate

Then run some code!

I got an error like

python3.10/site-packages/torch/lib/libtorch_cpu.so: undefined symbol: roctracer_next_record, version ROCTRACER_4.1

had a typo in the module load rocm/5.4.3 line which ended up loading an old rocm. Loading the correct rocm/5.4.3 solved it.

I also got an error like

Device-side assertion `t >= 0 && t < n_classes' failed.

My class labels did not start with 0, they accidentally started with the taxonomy ID (some large number). Replacing all taxonomy IDs by a counter 0 to len(tax_ids) fixed this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment