Skip to content

Instantly share code, notes, and snippets.

@ravi9
Created April 26, 2024 18:22

Optimizing BAAI general embedding (BGE) with OpenVINO

This guide provides detailed instructions for optimizing the BAAI general embedding (BGE) - https://huggingface.co/BAAI/bge-reranker-base model with OpenVINO and Optimum Intel.

Environment Setup

To prepare your environment for model optimization and inference:

sudo apt update
sudo apt install git-lfs -y

python3 -m venv openvino-env
source openvino-env/bin/activate
pip install --upgrade pip

python -m pip install "optimum-intel[openvino]"@git+https://github.com/huggingface/optimum-intel.git

Sample BGE Pipeline with OpenVINO

Optimize your Hugging Face models for inference using the OpenVINO runtime by replacing standard transformer model classes with corresponding OpenVINO classes. See docs.

For example, AutoModelForXxx becomes OVModelForXxx. For BGE, use OVModelForFeatureExtraction as shown below:

from transformers import AutoTokenizer, AutoModel
from optimum.intel import OVModelForFeatureExtraction
import torch
# Sentences we want sentence embeddings for
sentences = ["样例数据-1", "样例数据-2"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh-v1.5')
# model = AutoModel.from_pretrained('BAAI/bge-large-zh-v1.5')
# model.eval()
model = OVModelForFeatureExtraction.from_pretrained('BAAI/bge-large-zh-v1.5', export=True)

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# for s2p(short query to long passage) retrieval task, add an instruction to query (not add instruction for passages)
# encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
    # Perform pooling. In this case, cls pooling.
    sentence_embeddings = model_output[0][:, 0]
# normalize embeddings
sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
print("Sentence embeddings:", sentence_embeddings)

For 8-bit quantization during model loading, set load_in_8bit=True when calling from_pretrained():

model = OVModelForFeatureExtraction.from_pretrained('BAAI/bge-large-zh-v1.5', load_in_8bit=True, export=True)

NOTE: The load_in_8bit option is enabled by default for models with more than 1 billion parameters, which can be disabled with load_in_8bit=False.

Exporting Models with Weight Compression Using Optimum-CLI

Utilize the Optimum Intel CLI to export models from HuggingFace to OpenVINO IR with various levels of weight compression:

optimum-cli export openvino --model MODEL_ID --weight-format WEIGHT_FORMAT --output EXPORT_PATH

Replace placeholders appropriately:

  • MODEL_ID: ID of the HuggingFace model.
  • WEIGHT_FORMAT: Desired weight format, options include {fp32,fp16,int8,int4,int4_sym_g128,int4_asym_g128,int4_sym_g64,int4_asym_g64}. Refer to the Optimum Intel documentation for more details.
  • EXPORT_PATH: Directory path for storing the exported OpenVINO model.
  • --ratio RATIO: (Default: 0.8) Compression ratio between primary and backup precision. In the case of INT4, NNCF evaluates layer sensitivity and keeps the most impactful layers in INT8 precision (by default 20% in INT8). This helps to achieve better accuracy after weight compression.

To see complete usage, execute:

optimum-cli export openvino -h

Example commands to export BAAI/bge-large-zh-v1.5 with different precision formats (FP16, INT8, and INT4):

optimum-cli export openvino --model BAAI/bge-large-zh-v1.5 --weight-format fp16 bge_ov_model_fp16
optimum-cli export openvino --model BAAI/bge-large-zh-v1.5 --weight-format int8 bge_ov_model_int8
optimum-cli export openvino --model BAAI/bge-large-zh-v1.5 --weight-format int4 bge_ov_model_int4

NOTE: If you see unexpected results, please add --library sentence-transformers to the above export commands.

After conversion, pass the converted model path as model_id when using from_pretrained(). Also, you can determine your target device (CPU, GPU, or MULTI:CPU,GPU) as device argument in that method.

from transformers import AutoTokenizer, AutoModel
from optimum.intel import OVModelForFeatureExtraction
import torch
# Sentences we want sentence embeddings for
sentences = ["样例数据-1", "样例数据-2"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh-v1.5')

device = "CPU"
ov_config = {"PERFORMANCE_HINT": "LATENCY", "CACHE_DIR": "./ov_cache"}

model = OVModelForFeatureExtraction.from_pretrained(model_id='./bge_ov_model_int8', device=device, ov_config=ov_config, export=False)

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# for s2p(short query to long passage) retrieval task, add an instruction to query (not add instruction for passages)
# encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
    # Perform pooling. In this case, cls pooling.
    sentence_embeddings = model_output[0][:, 0]
# normalize embeddings
sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
print("Sentence embeddings:", sentence_embeddings)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment