sagunsh/llama-setup.md

## llama-setup.md

      
    Raw
  

              llama-setup.md
            
          
    Llama-2 setup instructions

The following instructions were used to get Facebook's Llama-2 up and running on Ubuntu 22.04 (70B model) and M1 Macbook Air (7B model).
Divided into 2 parts:

Part 1: Download models from facebook's repo: https://github.com/facebookresearch/llama
Part 2: Use llama.cpp repo to convert the model to make inference: https://github.com/ggerganov/llama.cpp

Important if you are tyring to work with the 70B model and have 500 GB or less free space

This process requires a lot of free space if you are downloading the 70B model. Even with 500 GB space, I was running out of space in the middle because of a lot of intermediate files being generated.


Use df -h to keep checking free space or run watch df -h in a separate terminal to keep watching the space every 2 seconds.


After downloading the model (Part 1) and converting it to ggml-model (Part 2 step 4), I moved my downloaded model from part 1 (consolidated.xx.pth files) to another hard drive to free some space before running the quantize command (part 2 step 5).

Part 1:


Install python 3.9 or above. Most of the recent linux distro comes with it.


Setup virtualenv (optional but recommened)


Goto https://ai.meta.com/resources/models-and-libraries/llama-downloads/ and request for access.
It should take around 5-10 minutes for you to receive an email from Meta AI. Meanwhile, you can complete step 4 to 7.


Install git.
# for ubuntu
(venv) $ sudo apt update
(venv) $ sudo apt install git


Also install wget and md5sum if you don't have it already.


Clone facebook's repo
(venv) $ git clone https://github.com/facebookresearch/llama.git


Once the clone is complete, go inside the llama directory and install requirements. This will take around 10 minutes.
(venv) $ cd llama
(venv) $ pip install -e .


Make download.sh executable and run it:
(venv) $ chmod +x download.sh
(venv) $ ./download.sh


After running the script, you will be prompted to enter the link that you received in step 3. Link starts with https://download.llamameta.net/*?Policy=eyJTdG.... Copy that properly, paste and hit enter.


Then you will be asked to choose a model (something like below):
Enter the list of models to download without spaces (7B,13B,70B,7B-chat,13B-chat,70B-chat), or press Enter for all:


70B is the largest with size around 129 GB. It took me around 14 hrs to download it completely.
So make sure you have sufficient space, good internet speed and extra time to work on it.
If you are just starting out, use the 7B model to test things out.
Alternatively, look into huggingface transformers which, I think may require a paid account.


Part 2:


In a new directory outside the llama dir from above, clone this repo:
(venv) $ cd ..
(venv) $ git clone https://github.com/ggerganov/llama.cpp.git


Go inside the directory and run make command:
(venv) $ cd llama.cpp
(venv) $ make


Install requirements:
(venv) $ pip install -r requirements.txt


Running convert.py
(venv) $ python convert.py <path to downloaded model folder>

for e.g. in my case it is like this:
(venv) $ python convert.py ../llama/llama-2-70b-chat/

I run this script from inside llama.cpp directory and my directory structure looks like this:
parent_folder/
    llama/                        ---- cloned facebook's llama repo
        llama-2-70b-chat/         ---- downloaded model from facebook
        other files in that repo
    llama.cpp/                    ---- cloned ggerganov/llama.cpp repo
        convert.py                ---- script to run
        other files in that repo

Read through convert.py script's main function on around line 1282 to know more about other parameters.


Quantize the model
(venv) $ ./quantize ../llama/llama-2-70b/ggml-model-f16.gguf ../llama/llama-2-70b/ggml-model-q4_0.gguf q4_0


Run the inference
# for 70B model we need to add -gqa 8 
(venv) $ ./main -m ../llama/llama-2-70b-chat/ggml-model-q4_0.bin -n 128 -gqa 8

# for 7B model
(venv) $ ./main -m ../llama/llama-2-70b-chat/ggml-model-q4_0.bin -n 128