airMeng/NeuralSpeed X ITREX.md

## NeuralSpeed X ITREX.md

      
    Raw
  

              NeuralSpeed X ITREX.md
            
          
    NerualSpeed(NS) is designed to provide the efficient inference of large language models (LLMs) on Intel platforms through the state-of-the-art (SOTA) model compression techniques. The work is highly inspired from llama.cpp.
Intel® Extension for Transformers(ITREX) is an innovative toolkit to accelerate Transformer-based models on Intel platforms, in particular, effective on 4th Intel Xeon Scalable processor Sapphire Rapids (codenamed Sapphire Rapids).
Install

Basically NS is a optional dependency of ITREX. You can install ITREX via binary wheel and NS will be installed as one of the requitements.
# define install requirements
install_requires_list = ['packaging', 'numpy', 'schema', 'pyyaml']
- opt_install_requires_list = ['neural_compressor', 'transformers']
+ opt_install_requires_list = ['neural_compressor', 'transformers', 'neural_speed']
Or you can install ITREX from the source and determine NS installation manually.
# in the root directory of ITREX
NS=true pip install . 
Or you can install the latest NS as a seperate python package via building from source
# in the root directory of NS
pip install .
Check whether the building is finished via
from intel_extension_for_transformers.utils import itrex_utils
itrex_utils.is_ns_available()
# check both NS and GPU related, optional
itrex_utils.is_ns_available("gpu")

Usage

LLM Inference

As detailed in ITREX documents, NS is the default inference option. The following ITREX example demonstrate how to leverage NS 4bits capability morever.
from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v1-1"     # Hugging Face model_id or local model
prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

Diffusion Inference

ITREX also (will) support diffuser API, in which you can also enable NS inference during language models. You can dispatch the encoder model to NS manually:
from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
tokenizer = CLIPTokenizer.from_pretrained(args.pretrained_model_name_or_path, subfolder="tokenizer")
model_name = "Intel/xxx-encode"     # Hugging Face model_id or local model
text_encoder = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
vae = AutoencoderKL.from_pretrained(args.pretrained_model_name_or_path, subfolder="vae")
unet = UNet2DConditionModel.from_pretrained(args.pretrained_model_name_or_path, subfolder="unet")

pipeline = StableDiffusionPipeline(
    text_encoder=text_encoder,
    vae=vae,
    unet=unet,
    tokenizer=tokenizer,
    scheduler=PNDMScheduler.from_config("CompVis/stable-diffusion-v1-4", subfolder="scheduler"),
    safety_checker=StableDiffusionSafetyChecker.from_pretrained("CompVis/stable-diffusion-safety-checker"),
    feature_extractor=CLIPFeatureExtractor.from_pretrained("openai/clip-vit-base-patch32"),
)

Or you can leverage ITREX end2end:
from intel_extension_for_transformers.transformers import AutoModelForDiffuser
model_name = "Intel/XXX-sd"     # Hugging Face model_id or local model
prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids

model = AutoModelForDiffuser.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs, steps=50)

Inside

We simply replace the original LLM model with their NS counterparts.
if use_llm_runtime:
    logger.info("Using LLM runtime.")
    quantization_config.post_init_runtime()
    from neural_speed import Model
    model = Model()
    model.init(
        pretrained_model_name_or_path,
        weight_dtype=quantization_config.weight_dtype,
        alg=quantization_config.scheme,
        group_size=quantization_config.group_size,
        scale_dtype=quantization_config.scale_dtype,
        compute_dtype=quantization_config.compute_dtype,
        use_ggml=quantization_config.use_ggml,
        not_quant=quantization_config.not_quant,
        use_cache=quantization_config.use_cache,
    )
    return model

Summary


      graph LR;
    Transformer-API-->ITREX;
    ITREX--> CPU 
    CPU --> |Weight-Only quantizated LLM sub-models| NeuralSpeed
    ITREX--> GPU
    GPU --> IPEX
    CPU --> |Other sub-models| IPEX;
    NeuralSpeed --> |CPU| Bestla;
    IPEX --> |CPU| Bestla
    IPEX --> |GPU| XeTLA;

    
      Loading