Skip to content

Instantly share code, notes, and snippets.

@teknium1
Last active June 4, 2023 13:37
Show Gist options
  • Save teknium1/c022705857ba943fb2b7e4470d8677fb to your computer and use it in GitHub Desktop.
Save teknium1/c022705857ba943fb2b7e4470d8677fb to your computer and use it in GitHub Desktop.
HuggingFace Transformers Inference for Alpaca
import time, torch
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig
### CREDITS: Tatsu-lab @ github for Original Alpaca Model/Dataset/Inference Code. @Main for much of inference code - https://twitter.com/main_horse - @Teknium1 for guide - https://twitter.com/Teknium1
### Requires: Nvidia GPU with at least 11GB vram (in 8bit) or 20GB without 8bit
### Download Latest Files from https://huggingface.co/chavinlo/alpaca-native/tree/main - Whichever checkpoint-xxx is the highest (this is full fine tuned model, not LORA)
### Your folder structure before running should include config.json, pytorch_model.bin.index.json, pytorch_model-00001-3.bin, tokenizer.model, and tokenizer_config.json
### Change ./checkpoint-800/ to the directory of your HF-Format Model Files Directory
### Requires CUDA Enabled Pytorch. Installation guide here: https://pytorch.org/get-started/locally/
### Currently Requires transformers install from GitHub (not pypackage) - use pip install git+https://github.com/huggingface/transformers.git
### You need at least 24GB of VRAM to run the model in fp16 (for the 7B Alpaca). You need to install bitsandbytes and set load_in_8bit=true to run in 8bit,
### which can allow running on 12GB VRAM. BitsandBytes does not have native support on Windows, so be advised.
### Here is a guide to get BitsandBytes setup on Windows for 8bit: https://rentry.org/llama-tard-v2#install-bitsandbytes-for-8bit-support-skip-this-on-linux
tokenizer = LlamaTokenizer.from_pretrained("./checkpoint-800/")
# Leave this generate_prompt in tact - the fine tune requires prompts to be in this format
def generate_prompt(instruction, input=None):
if input:
return f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Input:
{input}
### Response:"""
else:
return f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
{instruction}
### Response:"""
model = LlamaForCausalLM.from_pretrained(
"checkpoint-800",
load_in_8bit=False,
torch_dtype=torch.float16,
device_map="auto"
)
while True:
text = generate_prompt(input("User: "))
time.sleep(1)
input_ids = tokenizer(text, return_tensors="pt").input_ids.to("cuda")
generated_ids = model.generate(input_ids, max_new_tokens=250, do_sample=True, repetition_penalty=1.0, temperature=0.8, top_p=0.75, top_k=40)
print(tokenizer.decode(generated_ids[0]))
@teknium1
Copy link
Author

Any suggestions to improve the file please lmk here or on twitter @teknium1

@teknium1
Copy link
Author

This is what should be in your model's folder:
image

@0xprincess
Copy link

0xprincess commented Mar 17, 2023

image
I've actually came up with a very similar code (based on the alpaca-lora notebook), but I'm getting this error too often. As I understand, it's a precision overflow issue, and it happens on both int8 and float16 versions on a long generations (on both local 4090 torch2+cudnn setup and alpaca-lora colab), all fine on small ones.
And it happens on alpaca-native and alpaca-lora, so still don't know how to solve it, except loading fp32 on colab

@0xprincess
Copy link

0xprincess commented Mar 17, 2023

Also, alpaca-lora inference code contains the following argument to model.generate: attention_mask=inputs["attention_mask"].to("cuda"), where inputs is the output from tokenizer.
Do you think it's needed here too, or it should be applied by default?

@teknium1
Copy link
Author

teknium1 commented Mar 17, 2023

Also, alpaca-lora inference code contains the following argument to model.generate: attention_mask=inputs["attention_mask"].to("cuda"), where inputs is the output from tokenizer. Do you think it's needed here too, or it should be applied by default?

I think with huggingface it is just the way I have it here; I could be wrong though.

input_ids = tokenizer(text, return_tensors="pt").input_ids.to("cuda")
    generated_ids = model.generate(input_ids,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment