This is a refined version of huggingface/LLaMA2 Usage tips.
Go to somewhere you want to put the original llama2 7b model weights, maybe in the download folder. Create a folder called LLaMA:
cd ~/downloads (or maybe somewhere else)
mkdir LLaMA
cd LLaMA
Clone the model weights, it takes about 15GB disk space:
git lfs clone https://huggingface.co/meta-llama/Llama-2-7b
Once you finish, rename the folder to 7B, and copy tokenizer.model
from inner 7B folder out:
mv Llama-2-7b 7B
cp ./7B/tokenizer.model ./
go back to where want to put converted version of model weights:
cd ~/huggingface (or maybe somewhere else as you wish)
mkdir llama2-7b-base-hf
activate an virtual environment with required libraries installed:
conda activate your_env_name
pip install torch transformers tokenizers protobuf sentencepiece accelerate flash_attn bitsandbytes datasets
convert the model:
python -m transformers.models.llama.convert_llama_weights_to_hf --input_dir ~/downloads/LLaMA --model_size 7B --llama_version 2 --output_dir ~/huggingface/llama2-7b-base-hf
Here we use previously created directories, you should replace input_dir
and output_dir
with your path, just in case.
Once you've done, try to load model from the output_dir
folder:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("/home/zgong6/repos/experimental/llama2-hf/7b-base")
model.cuda() # it takes about 26GB GPU memory to load the weight on GPU, or you can remove this line to load the weight with CPU.
tokenizer = AutoTokenizer.from_pretrained("/home/zgong6/repos/experimental/llama2-hf/7b-base")
Issue an inference to see if its working well:
inputs = tokenizer("this is a story,", return_tensors='pt').to('cuda')
output = model.generate(**inputs)
print(tokenizer.batch_decode(output))
check your result and enjoy.