Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save visualDust/11e5119655f54cd011c446d1609cea5f to your computer and use it in GitHub Desktop.
Save visualDust/11e5119655f54cd011c446d1609cea5f to your computer and use it in GitHub Desktop.
convert downloaded llama2 weights to huggingface format so that you can load it with huggingface/transformers from_pretrained

Download and convert llama-7b model to huggingface format

This is a refined version of huggingface/LLaMA2 Usage tips.

Go to somewhere you want to put the original llama2 7b model weights, maybe in the download folder. Create a folder called LLaMA:

cd ~/downloads (or maybe somewhere else)
mkdir LLaMA
cd LLaMA

Clone the model weights, it takes about 15GB disk space:

git lfs clone https://huggingface.co/meta-llama/Llama-2-7b

Once you finish, rename the folder to 7B, and copy tokenizer.model from inner 7B folder out:

mv Llama-2-7b 7B
cp ./7B/tokenizer.model ./

go back to where want to put converted version of model weights:

cd ~/huggingface (or maybe somewhere else as you wish)
mkdir llama2-7b-base-hf

activate an virtual environment with required libraries installed:

conda activate your_env_name
pip install torch transformers tokenizers protobuf sentencepiece accelerate flash_attn bitsandbytes datasets

convert the model:

python -m transformers.models.llama.convert_llama_weights_to_hf --input_dir ~/downloads/LLaMA --model_size 7B --llama_version 2 --output_dir ~/huggingface/llama2-7b-base-hf

Here we use previously created directories, you should replace input_dir and output_dir with your path, just in case.

Once you've done, try to load model from the output_dir folder:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("/home/zgong6/repos/experimental/llama2-hf/7b-base")
model.cuda() # it takes about 26GB GPU memory to load the weight on GPU, or you can remove this line to load the weight with CPU.
tokenizer = AutoTokenizer.from_pretrained("/home/zgong6/repos/experimental/llama2-hf/7b-base")

Issue an inference to see if its working well:

inputs = tokenizer("this is a story,", return_tensors='pt').to('cuda')
output = model.generate(**inputs)

print(tokenizer.batch_decode(output))

check your result and enjoy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment