Skip to content

Instantly share code, notes, and snippets.

@TimDettmers
Created October 11, 2022 15:32
Show Gist options
  • Save TimDettmers/0147cdbad908eac5d3908465815dc476 to your computer and use it in GitHub Desktop.
Save TimDettmers/0147cdbad908eac5d3908465815dc476 to your computer and use it in GitHub Desktop.
Minimal example of 8-bit inference for LLMs via Hugging Face transformers + accelerate.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
MAX_NEW_TOKENS = 128
model_name = 'facebook/opt-6.7b'
text = """Hello, I am a prompt. Who are you?"""
tokenizer = AutoTokenizer.from_pretrained(model_name)
input_ids = tokenizer(text, return_tensors="pt").input_ids
free_in_GB = int(torch.cuda.mem_get_info()[0]/1024**3)
max_memory = f'{free_in_GB-2}GB'
n_gpus = torch.cuda.device_count()
max_memory = {i: max_memory for i in range(n_gpus)}
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map='auto',
load_in_8bit=True,
max_memory=max_memory
)
generated_ids = model.generate(input_ids, max_length=MAX_NEW_TOKENS)
print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment