Skip to content

Instantly share code, notes, and snippets.

@Johannisbaer
Last active April 24, 2024 07:08
Show Gist options
  • Save Johannisbaer/1a1435dd0cbc01b4a477dad3017e960a to your computer and use it in GitHub Desktop.
Save Johannisbaer/1a1435dd0cbc01b4a477dad3017e960a to your computer and use it in GitHub Desktop.
Running LLama 3 on Consumer Hardware: AMD 6900 XT on Ubuntu 22.04

Running LLama 3 8B Instruct on a single consumer grade AMD graphics card

LLama 3 8B is one of the newest models released by Meta. It punches above its weight class, offering high performance with lower resource consumption than some 22B or 70B models. Its size and output quality makes it interesting for me for systems handeling sensitive data when run locally/offline.

Llama3_picture

Prerequisites

This guide uses AMD GPUs, with Nvidia you can skip the ROCm install and dependencies to run the script.

Operating System: Ubuntu 22.04 Hardware: AMD GPU with 16GB VRAM

1. Install ROCm
First, install ROCm to enable GPU support. Follow the detailed installation guide here: ROCm Installation on Ubuntu 22.04

2. Test ROCm Installation
Verify the ROCm installation by running the following command in your terminal:

sudo /opt/rocm/bin/rocm-smi

3. Install radeontop
Install radeontop to monitor GPU utilization:

sudo apt install radeontop

To monitor your AMD GPU, run:

radeontop

4. Setup Python Environment
Create a new virtual environment (venv) or Conda environment named LLM_Test:

For venv:

python3 -m venv LLM_Test
source LLM_Test/bin/activate

For Conda:

conda create --name LLM_Test python=3.8
conda activate LLM_Test

Set up the new interpreter in VS Code by following this guide: Python Environments in VS Code

5. Update pip and Install Packages
Update pip and install necessary Python packages:

pip install --upgrade pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.7
pip install transformers accelerate huggingface-hub

6. Setup Hugging Face Account

Create a Hugging Face account and generate an access token. Apply for LLama access on Hugging Face by scrolling down on the model details page: Meta-Llama-3-8B-Instruct on Hugging Face
Meta gives you 24 hours to download the model after your access was granted.

7. Run the LLama Model
The following Python script is a modified version of Meta's offical test script on the HF Model Card to run the LLama 3. Replace "Your access token goes here" with your actual Hugging Face access token.

from huggingface_hub import login
login("Your access token goes here")
import transformers
import torch
import gc

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
pipeline = transformers.pipeline("text-generation", model=model_id, model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto")

messages = [
    {"role": "system", "content": "You are an happy supply chain expert for pharma who always responds with simple analogies!"},
    {"role": "user", "content": "What is a Bill of Material?"}
]

prompt = pipeline.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
terminators = [pipeline.tokenizer.eos_token_id, pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")]

outputs = pipeline(prompt, max_new_tokens=256, eos_token_id=terminators, do_sample=True, temperature=0.6, top_p=0.9)
print(outputs[0]["generated_text"][len(prompt):])

del pipeline
torch.cuda.empty_cache()
gc.collect()

Conclusion
This guide walks you through setting up your AMD 6900 XT on Ubuntu 22.04 to run LLama 3 8B in full precision. In practice you can lower the VRAM demand further by using tools such as bitsandbytes or llama.cpp withot much quality loss. If you encounter any issues, refer to the installation steps at Hugging face and ensure to install missing pip packages if they are mentioned in the error messages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment