airMeng/LLM Int4 Inference on Arc.md

## LLM Int4 Inference on Arc.md

      
    Raw
  

              LLM Int4 Inference on Arc.md
            
          
    IPEX

Intel® Extension for PyTorch(IPEX) extends PyTorch* with up-to-date features optimizations for an extra performance boost on Intel hardware. Optimizations take advantage of Intel Xe Matrix Extensions (XMX) AI engines on Intel discrete GPUs. Moreover, Intel® Extension for PyTorch* provides easy GPU acceleration for Intel discrete GPUs through the PyTorch* xpu device.
XeTLA

Intel® Xe Templates for Linear Algebra (Intel® XeTLA) is a collection of SYCL/ESIMD templates that enable high-performance General Matrix Multiply (GEMM), Convolution (CONV), and related computations on Intel Xe GPU architecture. Intel® XeTLA offers reusable C++ templates for kernel, group and subgroup levels, allowing developers to optimize and specialize kernels based on data types, tiling policies, algorithms, fusion policies, and more.
Users can easily define new compression/de-compression prologue and insert right between BRGEMM to fully accelerate WOQ GEMM due to XeTLA's template designs.

we use a lightly modified version to enable further optimization on Arc GPU.

Quantization and De-quantization

We enabled 2 kinds of int4 in XeTLA micro kernels: S4_Clip and S4_Fullrange. S4_Clip refers to [-7, 7], totally 15 numbers. while S4_Fullrange refers to [-8, 7], totally 16 numbers. The later kinds are more precisely, but in real practice will be mapped to [0, 15], then during dequantization we need to minus 8 in advance..
if constexpr (compute_policy::quant_type
        == quant_mode::S4_FULLRANGE) {
    xetla_vector<int8_t, block_size_x_b *block_size_y_b>
            cvt_blk_i8
            = (cvt_blk.xetla_format<int8_t>()) - int8_t(8);
    cvt_blk_i32 = (cvt_blk_i8.xetla_format<int8_t>());
}
ITREX

Intel® Extension for Transformers(ITREX) is an innovative toolkit designed to accelerate GenAI/LLM everywhere with the optimal performance of Transformer-based models on various Intel platforms, including Intel Gaudi2, Intel CPU, and Intel GPU. The toolkit provides the seamless user experience of model compressions on Transformer-based models by extending Hugging Face transformers APIs and leveraging Intel® Neural Compressor
Usage

Example usage of int4 Qwen models in ITREX.
import torch
import intel_extension_for_pytorch as ipex
from intel_extension_for_transformers.transformers.modeling import AutoModelForCausalLM
from transformers import AutoTokenizer

device = "xpu"
model_name = "Qwen/Qwen-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
prompt = "Once upon a time, there existed a little girl,"
inputs = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

qmodel = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True, device_map="xpu", trust_remote_code=True)

# optimize the model with ipex, it will improve performance.
qmodel = ipex.optimize_transformers(qmodel, inplace=True, dtype=torch.float16, woq=True, device="xpu")

output = user_model.generate(inputs)
You can directly use example script
python run_generation_gpu_woq.py --woq --benchmark 
Performance and accuacy