Johannisbaer/LLama3_AMD.md

## LLama3_AMD.md

      
    Raw
  

              LLama3_AMD.md
            
          
    Running LLama 3 8B Instruct on a single consumer grade AMD graphics card


LLama 3 8B is one of the newest models released by Meta. It punches above its weight class, offering high performance with lower resource consumption than some 22B or 70B models. Its size and output quality makes it interesting for me for systems handeling sensitive data when run locally/offline.

Prerequisites
This guide uses AMD GPUs, with Nvidia you can skip the ROCm install and dependencies to run the script.
Operating System: Ubuntu 22.04
Hardware: AMD GPU with 16GB VRAM
1. Install ROCm

First, install ROCm to enable GPU support. Follow the detailed installation guide here: ROCm Installation on Ubuntu 22.04
2. Test ROCm Installation

Verify the ROCm installation by running the following command in your terminal:
sudo /opt/rocm/bin/rocm-smi
3. Install radeontop

Install radeontop to monitor GPU utilization:
sudo apt install radeontop
To monitor your AMD GPU, run:
radeontop
4. Setup Python Environment

Create a new virtual environment (venv) or Conda environment named LLM_Test:
For venv:
python3 -m venv LLM_Test
source LLM_Test/bin/activate
For Conda:
conda create --name LLM_Test python=3.8
conda activate LLM_Test
Set up the new interpreter in VS Code by following this guide: Python Environments in VS Code
5. Update pip and Install Packages

Update pip and install necessary Python packages:
pip install --upgrade pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.7
pip install transformers accelerate huggingface-hub
6. Setup Hugging Face Account
Create a Hugging Face account and generate an access token. Apply for LLama access on Hugging Face by scrolling down on the model details page: Meta-Llama-3-8B-Instruct on Hugging Face

Meta gives you 24 hours to download the model after your access was granted.
7. Run the LLama Model

The following Python script is a modified version of Meta's offical test script on the HF Model Card to run the LLama 3. Replace "Your access token goes here" with your actual Hugging Face access token.
from huggingface_hub import login
login("Your access token goes here")
import transformers
import torch
import gc

model_id = "meta-llama/Meta-Llama-3-8B-Instruct"
pipeline = transformers.pipeline("text-generation", model=model_id, model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto")

messages = [
    {"role": "system", "content": "You are an happy supply chain expert for pharma who always responds with simple analogies!"},
    {"role": "user", "content": "What is a Bill of Material?"}
]

prompt = pipeline.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
terminators = [pipeline.tokenizer.eos_token_id, pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")]

outputs = pipeline(prompt, max_new_tokens=256, eos_token_id=terminators, do_sample=True, temperature=0.6, top_p=0.9)
print(outputs[0]["generated_text"][len(prompt):])

del pipeline
torch.cuda.empty_cache()
gc.collect()
Conclusion

This guide walks you through setting up your AMD 6900 XT on Ubuntu 22.04 to run LLama 3 8B in full precision. In practice you can lower the VRAM demand further by using tools such as bitsandbytes or llama.cpp withot much quality loss. If you encounter any issues, refer to the installation steps at Hugging face and ensure to install missing pip packages if they are mentioned in the error messages.