Created
August 29, 2025 20:45
-
-
Save HAKSOAT/0a39edeacd737b8e9e403bc1c049dbd4 to your computer and use it in GitHub Desktop.
BGE-M3-Extended-Naija
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| I have a task for you, I need you to build a search interface where there is a search bar that takes in a yoruba query, then on clicking search will return the top 10 results. | |
| Each result is a verse from a Qur'an Surah. | |
| I need you to parse the verses 1 to 135 from the files: | |
| https://github.com/Niger-Volta-LTI/yoruba-text/blob/master/Quran_Mimo/Whole_Yoruba_Quran_0479.txt | |
| The verses are in the files from 0462 to but not including 0479. | |
| You should create an vector index from those verses, you can just keep the embeddings in memory actually, you create the embeddings using huggingface "from transformers import AutoTokenizer, AutoModel". | |
| The model name is "[REDACTED]". | |
| You likely instantiate like: | |
| tokenizer = AutoTokenizer.from_pretrained(model_spec['model_name']) | |
| embed_model = AutoModel.from_pretrained(model_spec['model_name'], **model_spec['kwargs']) | |
| Where model spec is something like: | |
| {'model_name':"[REDACTED]",'max_length':8192, 'pooling_type':'cls', 'vector_type': 'multi-vector', | |
| 'normalize': True, 'batch_size':8, 'kwargs': {'device_map': 'cpu', 'torch_dtype':torch.float16} | |
| This helper functions might be useful: | |
| def mean_pooling(model_output): | |
| return torch.mean(model_output["last_hidden_state"], dim=1) | |
| def cls_pooling(model_output): | |
| return model_output[0][:, 0] | |
| def last_token_pooling(model_output): | |
| return model_output[0][:, -1] | |
| def get_sentence_embedding(text, tokenizer, embed_model, normalize, max_length, pooling_type='cls'): | |
| if pooling_type=="last_token": | |
| encoded_input = tokenizer(text, max_length=max_length, return_attention_mask=False, padding=False, truncation=True) | |
| encoded_input['input_ids'] = encoded_input['input_ids'] + [tokenizer.eos_token_id] | |
| encoded_input = tokenizer.pad([encoded_input], padding=True, return_attention_mask=True, return_tensors='pt').to("cuda") | |
| else: | |
| encoded_input = tokenizer(text, return_tensors="pt", max_length=max_length, truncation=True).to("cuda") | |
| with torch.no_grad(): | |
| model_output = embed_model(**encoded_input) | |
| if pooling_type=="cls": | |
| sentence_embeddings = cls_pooling(model_output) | |
| if pooling_type=="mean": | |
| sentence_embeddings = mean_pooling(model_output) | |
| if pooling_type=="last_token": | |
| sentence_embeddings = last_token_pooling(model_output) | |
| if normalize: | |
| sentence_embeddings = F.normalize(sentence_embeddings) | |
| return sentence_embeddings |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment