# Optimizing BAAI general embedding (BGE) with OpenVINO

This guide provides detailed instructions for optimizing the BAAI general embedding (BGE) - https://huggingface.co/BAAI/bge-reranker-base model with OpenVINO and [Optimum Intel](https://huggingface.co/docs/optimum/intel/inference#optimum-inference-with-openvino). 

## Environment Setup

To prepare your environment for model optimization and inference:

```bash
sudo apt update
sudo apt install git-lfs -y

python3 -m venv openvino-env
source openvino-env/bin/activate
pip install --upgrade pip

python -m pip install "optimum-intel[openvino]"@git+https://github.com/huggingface/optimum-intel.git
```

## Sample BGE Pipeline with OpenVINO

Optimize your Hugging Face models for inference using the OpenVINO runtime by replacing standard transformer model classes with corresponding OpenVINO classes. See [docs](https://huggingface.co/docs/optimum/intel/inference#transformers-models).

For example, `AutoModelForXxx` becomes `OVModelForXxx`. For BGE, use `OVModelForFeatureExtraction` as shown below:

```python
from transformers import AutoTokenizer, AutoModel
from optimum.intel import OVModelForFeatureExtraction
import torch
# Sentences we want sentence embeddings for
sentences = ["样例数据-1", "样例数据-2"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh-v1.5')
# model = AutoModel.from_pretrained('BAAI/bge-large-zh-v1.5')
# model.eval()
model = OVModelForFeatureExtraction.from_pretrained('BAAI/bge-large-zh-v1.5', export=True)

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# for s2p(short query to long passage) retrieval task, add an instruction to query (not add instruction for passages)
# encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
    # Perform pooling. In this case, cls pooling.
    sentence_embeddings = model_output[0][:, 0]
# normalize embeddings
sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
print("Sentence embeddings:", sentence_embeddings)
```

For 8-bit quantization during model loading, set `load_in_8bit=True` when calling `from_pretrained()`:

```python
model = OVModelForFeatureExtraction.from_pretrained('BAAI/bge-large-zh-v1.5', load_in_8bit=True, export=True)
```

**NOTE**: The `load_in_8bit` option is enabled by default for models with more than 1 billion parameters, which can be disabled with `load_in_8bit=False`.

## Exporting Models with Weight Compression Using Optimum-CLI

Utilize the [Optimum Intel CLI](https://github.com/huggingface/optimum-intel?tab=readme-ov-file#openvino) to export models from HuggingFace to OpenVINO IR with various levels of weight compression:

```bash
optimum-cli export openvino --model MODEL_ID --weight-format WEIGHT_FORMAT --output EXPORT_PATH
```

Replace placeholders appropriately:

- `MODEL_ID`: ID of the HuggingFace model.
- `WEIGHT_FORMAT`: Desired weight format, options include `{fp32,fp16,int8,int4,int4_sym_g128,int4_asym_g128,int4_sym_g64,int4_asym_g64}`. Refer to the [Optimum Intel documentation](https://huggingface.co/docs/optimum/intel/optimization_ov#weight-only-quantization) for more details.
- `EXPORT_PATH`: Directory path for storing the exported OpenVINO model.
- `--ratio RATIO`: (Default: 0.8) Compression ratio between primary and backup precision. In the case of INT4, NNCF evaluates layer sensitivity and keeps the most impactful layers in INT8 precision (by default 20% in INT8). This helps to achieve better accuracy after weight compression.

To see complete usage, execute:

```bash
optimum-cli export openvino -h
```

Example commands to export `BAAI/bge-large-zh-v1.5` with different precision formats (FP16, INT8, and INT4):

```bash
optimum-cli export openvino --model BAAI/bge-large-zh-v1.5 --weight-format fp16 bge_ov_model_fp16
optimum-cli export openvino --model BAAI/bge-large-zh-v1.5 --weight-format int8 bge_ov_model_int8
optimum-cli export openvino --model BAAI/bge-large-zh-v1.5 --weight-format int4 bge_ov_model_int4
```
**NOTE:** If you see unexpected results, please add `--library sentence-transformers` to the above export commands.

After conversion, pass the converted model path as `model_id` when using `from_pretrained()`.
Also, you can determine your target device (CPU, GPU, or MULTI:CPU,GPU) as `device` argument in that method. 
- In addition to `MULTI`, see [documentation](https://docs.openvino.ai/2024/openvino-workflow/running-inference/inference-devices-and-modes.html) for other supported device options:  `AUTO`,  `HETERO`,  `BATCH`. 
- `ov_config` - enables passing any OpenVINO configuration option as a dictionary. For details, refer to the [OpenVINO Advanced Features](https://docs.openvino.ai/2024/get-started.html#openvino-advanced-features) and [Performance Hints](https://docs.openvino.ai/2024/openvino-workflow/running-inference/optimize-inference/high-level-performance-hints.html).

```python

from transformers import AutoTokenizer, AutoModel
from optimum.intel import OVModelForFeatureExtraction
import torch
# Sentences we want sentence embeddings for
sentences = ["样例数据-1", "样例数据-2"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh-v1.5')

device = "CPU"
ov_config = {"PERFORMANCE_HINT": "LATENCY", "CACHE_DIR": "./ov_cache"}

model = OVModelForFeatureExtraction.from_pretrained(model_id='./bge_ov_model_int8', device=device, ov_config=ov_config, export=False)

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# for s2p(short query to long passage) retrieval task, add an instruction to query (not add instruction for passages)
# encoded_input = tokenizer([instruction + q for q in queries], padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
    # Perform pooling. In this case, cls pooling.
    sentence_embeddings = model_output[0][:, 0]
# normalize embeddings
sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
print("Sentence embeddings:", sentence_embeddings)

```