pdtgct/convert_hf_llama_to_ggml.md

## convert_hf_llama_to_ggml.md

      
    Raw
  

              convert_hf_llama_to_ggml.md
            
          
    The LLaMA model weights may be converted from Huggingface PyTorch format back to GGML in two steps:

download from decapoda-research/llama-7b-hf
and save as pytorch .pth
use the ggerganov/llama.cpp script,
convert-pth-to-ggml.py to convert from pytorch .pth to GGML

This process will result in ggml model with float16 (fp16) precision.
Prerequisite

You need the LLaMA tokenizer configuration and the model configuration files. There currently isn't a
good conversion from Hugging Face to the original pytorch (the tokenizer files are the same but the
model checklist.chk and params.json are missing). The best way to do this is to:

install pyllama and transformers:

pip install -U pyllama transformers

you will also need to install the requirements from ggerganov/llama.cpp:

llama.cpp $ pip install -r requirements.txt

download the 7B configuration (let the consolidated.00.pth - model weights download - fail):

python -m llama.download --model_size=7B --folder=llama
This will download a directory structure like:
llama/
  config.json
  ggml-vocab.bin
  tokenizer.model
  tokenizer_checklist.chk
  tokenizer_config.json
  7B/
    checklist.chk
    params.json
Your remaining task is to convert the Hugging Face pytorch pickle file to a pytorch state dict and convert that to GGML.
Conversion


load the Huggingface model and save the state dict as pytorch .pth (in EMP ensure you have the SSO Proxy on):

from transformers import AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained("decapoda-research/llama-7b-hf")
torch.save(model.state_dict(), "llama/7B/consolidated.00.pth")

consolidate the 7B files into a single directory, so you have:

llama_7b/
  config.json
  ggml-vocab.bin
  tokenizer.model
  tokenizer_checklist.chk
  tokenizer_config.json
  checklist.chk
  consolidated.00.pth
  params.json  

convert the consolidated.00.pth file to ggml-model-fp16.bin using the convert-transformers-to-ggml.py script from
llama.cpp

python convert-transformers-to-ggml.py llama_7B 1
When you are done, you will have file you can use with llama.cpp, but you have to put it back into the llama/7B/
directory.
llama/
  config.json
  ggml-vocab.bin
  tokenizer.model
  tokenizer_checklist.chk
  tokenizer_config.json
  7B/
    checklist.chk
    params.json
    ggml-model-fp16.bin  # <-- added here
Now you can use this with llama.cpp (after building llama.cpp):
./main -m ./models/7B/ggml-model-fp16.bin -n 128