maziyarpanahi/autogptq-dbrx-test.md

## autogptq-dbrx-test.md

      
    Raw
  

              autogptq-dbrx-test.md
            
          
    Let's checkout the PR:
git fetch origin pull/625/head:dbrx
git switch dbrx
pip install -vvv --no-build-isolation -e .
Download the model:
from huggingface_hub import snapshot_download, get_collection

repo_id="LnL-AI/dbrx-base-converted-v2-4bit-gptq-gptq-v2"
revision = "main"
local_cache_dir = f"/home/maziyar/.cache/huggingface/hub/models--{repo_id.replace('/', '--')}"
snapshot_download(repo_id=repo_id, revision=revision, local_dir_use_symlinks=True, force_download=False, local_dir=local_cache_dir)
Let's put the weights back together via combine_tensors.sh script:
cd /home/maziyar/.cache/huggingface/hub/models--LnL-AI--dbrx-base-converted-v2-4bit-gptq-gptq-v2/
chmod +x combine_tensors.sh
./combine_tensors.sh
Now let's load the model in Hugging Face for testing:
from transformers import AutoTokenizer, pipeline, TextStreamer

from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import torch

model_id = "/home/maziyar/.cache/huggingface/hub/models--LnL-AI--dbrx-base-converted-v2-4bit-gptq-gptq-v2/"

quantize_config = BaseQuantizeConfig(
        bits=4,
        group_size=128,
        damp_percent=0.005,
        desc_act=False,
        static_groups=False,
        sym=True,
        true_sequential=True,
        model_name_or_path=None,
        model_file_base_name=None,
        quant_method="gptq",
        checkpoint_format="gptq"
    )

model = AutoGPTQForCausalLM.from_quantized(
        model_id,
        trust_remote_code=True,
        device="cuda:0",
        model_basename="gptq_model-4bit-128g",
        quantize_config=quantize_config)


tokenizer = AutoTokenizer.from_pretrained("databricks/dbrx-base")
streamer = TextStreamer(tokenizer)


input_text = "What does it take to build a great LLM? Resopnd in 3 bullet points"
messages = [{"role": "user", "content": input_text}]
input_ids = tokenizer.apply_chat_template(messages, return_dict=True, tokenize=True, add_generation_prompt=False, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, max_new_tokens=200, streamer=streamer)
print(tokenizer.decode(outputs[0]))
Now you can have fun!!!
# model

# pipelines
outputs = pipe("What is a large language model?")
print(outputs[0]["generated_text"])