# Optimizing BAAI general embedding (BGE) with OpenVINO This guide provides detailed instructions for optimizing the BAAI general embedding (BGE) - https://huggingface.co/BAAI/bge-reranker-base model with OpenVINO and [Optimum Intel](https://huggingface.co/docs/optimum/intel/inference#optimum-inference-with-openvino). ## Environment Setup To prepare your environment for model optimization and inference: ```bash sudo apt update sudo apt install git-lfs -y python3 -m venv openvino-env source openvino-env/bin/activate pip install --upgrade pip python -m pip install "optimum-intel[openvino]"@git+https://github.com/huggingface/optimum-intel.git ``` ## Sample BGE Pipeline with OpenVINO Optimize your Hugging Face models for inference using the OpenVINO runtime by replacing standard transformer model classes with corresponding OpenVINO classes. See [docs](https://huggingface.co/docs/optimum/intel/inference#transformers-models). For example, `AutoModelForXxx` becomes `OVModelForXxx`. For BGE, use `OVModelForFeatureExtraction` as shown below: ```python from transformers import AutoTokenizer, AutoModel from optimum.intel import OVModelForFeatureExtraction import torch # Sentences we want sentence embeddings for sentences = ["样例数据-1", "样例数据-2"] # Load model from HuggingFace Hub tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh-v1.5') # model = AutoModel.from_pretrained('BAAI/bge-large-zh-v1.5') # model.eval() model = OVModelForFeatureExtraction.from_pretrained('BAAI/bge-large-zh-v1.5', export=True) # Tokenize sentences encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') # for s2p(short query to long passage) retrieval task, add an instruction to query (not add instruction for passages) # encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt') # Compute token embeddings with torch.no_grad(): model_output = model(**encoded_input) # Perform pooling. In this case, cls pooling. sentence_embeddings = model_output[0][:, 0] # normalize embeddings sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1) print("Sentence embeddings:", sentence_embeddings) ``` For 8-bit quantization during model loading, set `load_in_8bit=True` when calling `from_pretrained()`: ```python model = OVModelForFeatureExtraction.from_pretrained('BAAI/bge-large-zh-v1.5', load_in_8bit=True, export=True) ``` **NOTE**: The `load_in_8bit` option is enabled by default for models with more than 1 billion parameters, which can be disabled with `load_in_8bit=False`. ## Exporting Models with Weight Compression Using Optimum-CLI Utilize the [Optimum Intel CLI](https://github.com/huggingface/optimum-intel?tab=readme-ov-file#openvino) to export models from HuggingFace to OpenVINO IR with various levels of weight compression: ```bash optimum-cli export openvino --model MODEL_ID --weight-format WEIGHT_FORMAT --output EXPORT_PATH ``` Replace placeholders appropriately: - `MODEL_ID`: ID of the HuggingFace model. - `WEIGHT_FORMAT`: Desired weight format, options include `{fp32,fp16,int8,int4,int4_sym_g128,int4_asym_g128,int4_sym_g64,int4_asym_g64}`. Refer to the [Optimum Intel documentation](https://huggingface.co/docs/optimum/intel/optimization_ov#weight-only-quantization) for more details. - `EXPORT_PATH`: Directory path for storing the exported OpenVINO model. - `--ratio RATIO`: (Default: 0.8) Compression ratio between primary and backup precision. In the case of INT4, NNCF evaluates layer sensitivity and keeps the most impactful layers in INT8 precision (by default 20% in INT8). This helps to achieve better accuracy after weight compression. To see complete usage, execute: ```bash optimum-cli export openvino -h ``` Example commands to export `BAAI/bge-large-zh-v1.5` with different precision formats (FP16, INT8, and INT4): ```bash optimum-cli export openvino --model BAAI/bge-large-zh-v1.5 --weight-format fp16 bge_ov_model_fp16 optimum-cli export openvino --model BAAI/bge-large-zh-v1.5 --weight-format int8 bge_ov_model_int8 optimum-cli export openvino --model BAAI/bge-large-zh-v1.5 --weight-format int4 bge_ov_model_int4 ``` **NOTE:** If you see unexpected results, please add `--library sentence-transformers` to the above export commands. After conversion, pass the converted model path as `model_id` when using `from_pretrained()`. Also, you can determine your target device (CPU, GPU, or MULTI:CPU,GPU) as `device` argument in that method. - In addition to `MULTI`, see [documentation](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes.html) for other supported device options: `AUTO`, `HETERO`, `BATCH`. - `ov_config` - enables passing any OpenVINO configuration option as a dictionary. For details, refer to the [OpenVINO Advanced Features](https://docs.openvino.ai/2024/get-started.html#openvino-advanced-features) and [Performance Hints](https://docs.openvino.ai/2024/openvino-workflow/running-inference/optimize-inference/high-level-performance-hints.html). ```python from transformers import AutoTokenizer, AutoModel from optimum.intel import OVModelForFeatureExtraction import torch # Sentences we want sentence embeddings for sentences = ["样例数据-1", "样例数据-2"] # Load model from HuggingFace Hub tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh-v1.5') device = "CPU" ov_config = {"PERFORMANCE_HINT": "LATENCY", "CACHE_DIR": "./ov_cache"} model = OVModelForFeatureExtraction.from_pretrained(model_id='./bge_ov_model_int8', device=device, ov_config=ov_config, export=False) # Tokenize sentences encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') # for s2p(short query to long passage) retrieval task, add an instruction to query (not add instruction for passages) # encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt') # Compute token embeddings with torch.no_grad(): model_output = model(**encoded_input) # Perform pooling. In this case, cls pooling. sentence_embeddings = model_output[0][:, 0] # normalize embeddings sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1) print("Sentence embeddings:", sentence_embeddings) ```