Skip to content

Instantly share code, notes, and snippets.

@HAKSOAT
Created August 29, 2025 20:45
Show Gist options
  • Select an option

  • Save HAKSOAT/0a39edeacd737b8e9e403bc1c049dbd4 to your computer and use it in GitHub Desktop.

Select an option

Save HAKSOAT/0a39edeacd737b8e9e403bc1c049dbd4 to your computer and use it in GitHub Desktop.
BGE-M3-Extended-Naija
I have a task for you, I need you to build a search interface where there is a search bar that takes in a yoruba query, then on clicking search will return the top 10 results.
Each result is a verse from a Qur'an Surah.
I need you to parse the verses 1 to 135 from the files:
https://github.com/Niger-Volta-LTI/yoruba-text/blob/master/Quran_Mimo/Whole_Yoruba_Quran_0479.txt
The verses are in the files from 0462 to but not including 0479.
You should create an vector index from those verses, you can just keep the embeddings in memory actually, you create the embeddings using huggingface "from transformers import AutoTokenizer, AutoModel".
The model name is "[REDACTED]".
You likely instantiate like:
tokenizer = AutoTokenizer.from_pretrained(model_spec['model_name'])
embed_model = AutoModel.from_pretrained(model_spec['model_name'], **model_spec['kwargs'])
Where model spec is something like:
{'model_name':"[REDACTED]",'max_length':8192, 'pooling_type':'cls', 'vector_type': 'multi-vector',
'normalize': True, 'batch_size':8, 'kwargs': {'device_map': 'cpu', 'torch_dtype':torch.float16}
This helper functions might be useful:
def mean_pooling(model_output):
return torch.mean(model_output["last_hidden_state"], dim=1)
def cls_pooling(model_output):
return model_output[0][:, 0]
def last_token_pooling(model_output):
return model_output[0][:, -1]
def get_sentence_embedding(text, tokenizer, embed_model, normalize, max_length, pooling_type='cls'):
if pooling_type=="last_token":
encoded_input = tokenizer(text, max_length=max_length, return_attention_mask=False, padding=False, truncation=True)
encoded_input['input_ids'] = encoded_input['input_ids'] + [tokenizer.eos_token_id]
encoded_input = tokenizer.pad([encoded_input], padding=True, return_attention_mask=True, return_tensors='pt').to("cuda")
else:
encoded_input = tokenizer(text, return_tensors="pt", max_length=max_length, truncation=True).to("cuda")
with torch.no_grad():
model_output = embed_model(**encoded_input)
if pooling_type=="cls":
sentence_embeddings = cls_pooling(model_output)
if pooling_type=="mean":
sentence_embeddings = mean_pooling(model_output)
if pooling_type=="last_token":
sentence_embeddings = last_token_pooling(model_output)
if normalize:
sentence_embeddings = F.normalize(sentence_embeddings)
return sentence_embeddings
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment