This guide provides detailed instructions for optimizing the BAAI general embedding (BGE) - https://huggingface.co/BAAI/bge-reranker-base model with OpenVINO and Optimum Intel.
To prepare your environment for model optimization and inference:
sudo apt update
sudo apt install git-lfs -y
python3 -m venv openvino-env
source openvino-env/bin/activate
pip install --upgrade pip
python -m pip install "optimum-intel[openvino]"@git+https://github.com/huggingface/optimum-intel.git
Optimize your Hugging Face models for inference using the OpenVINO runtime by replacing standard transformer model classes with corresponding OpenVINO classes. See docs.
For example, AutoModelForXxx
becomes OVModelForXxx
. For BGE, use OVModelForFeatureExtraction
as shown below:
from transformers import AutoTokenizer, AutoModel
from optimum.intel import OVModelForFeatureExtraction
import torch
# Sentences we want sentence embeddings for
sentences = ["样例数据-1", "样例数据-2"]
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh-v1.5')
# model = AutoModel.from_pretrained('BAAI/bge-large-zh-v1.5')
# model.eval()
model = OVModelForFeatureExtraction.from_pretrained('BAAI/bge-large-zh-v1.5', export=True)
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# for s2p(short query to long passage) retrieval task, add an instruction to query (not add instruction for passages)
# encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling. In this case, cls pooling.
sentence_embeddings = model_output[0][:, 0]
# normalize embeddings
sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
print("Sentence embeddings:", sentence_embeddings)
For 8-bit quantization during model loading, set load_in_8bit=True
when calling from_pretrained()
:
model = OVModelForFeatureExtraction.from_pretrained('BAAI/bge-large-zh-v1.5', load_in_8bit=True, export=True)
NOTE: The load_in_8bit
option is enabled by default for models with more than 1 billion parameters, which can be disabled with load_in_8bit=False
.
Utilize the Optimum Intel CLI to export models from HuggingFace to OpenVINO IR with various levels of weight compression:
optimum-cli export openvino --model MODEL_ID --weight-format WEIGHT_FORMAT --output EXPORT_PATH
Replace placeholders appropriately:
MODEL_ID
: ID of the HuggingFace model.WEIGHT_FORMAT
: Desired weight format, options include{fp32,fp16,int8,int4,int4_sym_g128,int4_asym_g128,int4_sym_g64,int4_asym_g64}
. Refer to the Optimum Intel documentation for more details.EXPORT_PATH
: Directory path for storing the exported OpenVINO model.--ratio RATIO
: (Default: 0.8) Compression ratio between primary and backup precision. In the case of INT4, NNCF evaluates layer sensitivity and keeps the most impactful layers in INT8 precision (by default 20% in INT8). This helps to achieve better accuracy after weight compression.
To see complete usage, execute:
optimum-cli export openvino -h
Example commands to export BAAI/bge-large-zh-v1.5
with different precision formats (FP16, INT8, and INT4):
optimum-cli export openvino --model BAAI/bge-large-zh-v1.5 --weight-format fp16 bge_ov_model_fp16
optimum-cli export openvino --model BAAI/bge-large-zh-v1.5 --weight-format int8 bge_ov_model_int8
optimum-cli export openvino --model BAAI/bge-large-zh-v1.5 --weight-format int4 bge_ov_model_int4
NOTE: If you see unexpected results, please add --library sentence-transformers
to the above export commands.
After conversion, pass the converted model path as model_id
when using from_pretrained()
.
Also, you can determine your target device (CPU, GPU, or MULTI:CPU,GPU) as device
argument in that method.
- In addition to
MULTI
, see documentation for other supported device options:AUTO
,HETERO
,BATCH
. ov_config
- enables passing any OpenVINO configuration option as a dictionary. For details, refer to the OpenVINO Advanced Features and Performance Hints.
from transformers import AutoTokenizer, AutoModel
from optimum.intel import OVModelForFeatureExtraction
import torch
# Sentences we want sentence embeddings for
sentences = ["样例数据-1", "样例数据-2"]
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh-v1.5')
device = "CPU"
ov_config = {"PERFORMANCE_HINT": "LATENCY", "CACHE_DIR": "./ov_cache"}
model = OVModelForFeatureExtraction.from_pretrained(model_id='./bge_ov_model_int8', device=device, ov_config=ov_config, export=False)
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# for s2p(short query to long passage) retrieval task, add an instruction to query (not add instruction for passages)
# encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling. In this case, cls pooling.
sentence_embeddings = model_output[0][:, 0]
# normalize embeddings
sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
print("Sentence embeddings:", sentence_embeddings)