Skip to content

Instantly share code, notes, and snippets.

@sagunsh
Last active September 14, 2023 09:12
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sagunsh/750fbcd99e8412075cda5886347836e3 to your computer and use it in GitHub Desktop.
Save sagunsh/750fbcd99e8412075cda5886347836e3 to your computer and use it in GitHub Desktop.
Installing facebook's llama model locvally

Llama-2 setup instructions

The following instructions were used to get Facebook's Llama-2 up and running on Ubuntu 22.04 (70B model) and M1 Macbook Air (7B model).

Divided into 2 parts:

Important if you are tyring to work with the 70B model and have 500 GB or less free space

This process requires a lot of free space if you are downloading the 70B model. Even with 500 GB space, I was running out of space in the middle because of a lot of intermediate files being generated.

Use df -h to keep checking free space or run watch df -h in a separate terminal to keep watching the space every 2 seconds.

After downloading the model (Part 1) and converting it to ggml-model (Part 2 step 4), I moved my downloaded model from part 1 (consolidated.xx.pth files) to another hard drive to free some space before running the quantize command (part 2 step 5).

Part 1:

  1. Install python 3.9 or above. Most of the recent linux distro comes with it.

  2. Setup virtualenv (optional but recommened)

  3. Goto https://ai.meta.com/resources/models-and-libraries/llama-downloads/ and request for access.

    It should take around 5-10 minutes for you to receive an email from Meta AI. Meanwhile, you can complete step 4 to 7.

  4. Install git.

    # for ubuntu
    (venv) $ sudo apt update
    (venv) $ sudo apt install git
    
  5. Also install wget and md5sum if you don't have it already.

  6. Clone facebook's repo

    (venv) $ git clone https://github.com/facebookresearch/llama.git
    
  7. Once the clone is complete, go inside the llama directory and install requirements. This will take around 10 minutes.

    (venv) $ cd llama
    (venv) $ pip install -e .
    
  8. Make download.sh executable and run it:

    (venv) $ chmod +x download.sh
    (venv) $ ./download.sh
    
  9. After running the script, you will be prompted to enter the link that you received in step 3. Link starts with https://download.llamameta.net/*?Policy=eyJTdG.... Copy that properly, paste and hit enter.

  10. Then you will be asked to choose a model (something like below):

    Enter the list of models to download without spaces (7B,13B,70B,7B-chat,13B-chat,70B-chat), or press Enter for all:
    
    • 70B is the largest with size around 129 GB. It took me around 14 hrs to download it completely.
    • So make sure you have sufficient space, good internet speed and extra time to work on it.
    • If you are just starting out, use the 7B model to test things out.
    • Alternatively, look into huggingface transformers which, I think may require a paid account.

Part 2:

  1. In a new directory outside the llama dir from above, clone this repo:

    (venv) $ cd ..
    (venv) $ git clone https://github.com/ggerganov/llama.cpp.git
    
  2. Go inside the directory and run make command:

    (venv) $ cd llama.cpp
    (venv) $ make
    
  3. Install requirements:

    (venv) $ pip install -r requirements.txt
    
  4. Running convert.py

    (venv) $ python convert.py <path to downloaded model folder>
    

    for e.g. in my case it is like this:

    (venv) $ python convert.py ../llama/llama-2-70b-chat/
    

    I run this script from inside llama.cpp directory and my directory structure looks like this:

    parent_folder/
        llama/                        ---- cloned facebook's llama repo
            llama-2-70b-chat/         ---- downloaded model from facebook
            other files in that repo
        llama.cpp/                    ---- cloned ggerganov/llama.cpp repo
            convert.py                ---- script to run
            other files in that repo
    

    Read through convert.py script's main function on around line 1282 to know more about other parameters.

  5. Quantize the model

    (venv) $ ./quantize ../llama/llama-2-70b/ggml-model-f16.gguf ../llama/llama-2-70b/ggml-model-q4_0.gguf q4_0
    
  6. Run the inference

    # for 70B model we need to add -gqa 8 
    (venv) $ ./main -m ../llama/llama-2-70b-chat/ggml-model-q4_0.bin -n 128 -gqa 8
    
    # for 7B model
    (venv) $ ./main -m ../llama/llama-2-70b-chat/ggml-model-q4_0.bin -n 128
    
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment