Skip to content

Instantly share code, notes, and snippets.

@tadejsv
Last active July 11, 2021 00:47
Show Gist options
  • Save tadejsv/32091353449c301c6506f42e70410809 to your computer and use it in GitHub Desktop.
Save tadejsv/32091353449c301c6506f42e70410809 to your computer and use it in GitHub Desktop.
Jina 2.0 example

This script indexes ~800 poem verses from the huggingface poem_sentiment dataset, and uses a transformer model to index them, and performs a KNN search using FAISS module.

Before running, install all the requirements with these 3 commands:

conda create -n jina-2.0 -c conda-forge -c huggingface faiss-cpu datasets
conda activate jina-2.0
pip install jina sentence-transformers --pre

Here's how the output for the search phrase ("a mourning man") looks like:

[0]: sat mournfully guarding their corpses there,
[1]: dearest, why should i mourn, whimper, and whine, i that have yet to live?
[2]: taught by the sorrows that his age had known
[3]: the love that lived through all the stormy past,
[4]: some moment, nailed on sorrow's cross,
[5]: ay, knelt and worshipped on, as love in beauty's bower,
[6]: the crown of sorrow on their heads, their loss
[7]: inexorable death; and claims his right.
[8]: and the words which he utters, are--worship, or die!
[9]: and so i should be loved and mourned to-night.
import faiss
from datasets import load_dataset
from jina import Document, DocumentArray, Executor, Flow, requests
from sentence_transformers import SentenceTransformer
class TransformerEmbed(Executor): # Embedd text using transformers
def __init__(self, **kwargs):
super().__init__(**kwargs)
self.model = SentenceTransformer('paraphrase-MiniLM-L6-v2', device='cpu')
@requests
def embedd(self, docs: DocumentArray, **kwargs):
for d in docs:
d.embedding = self.model.encode([d.text]) # list as faiss needs 2d arrays
class FaissIndexer(Executor): # Simple exact FAISS indexer
def __init__(self, **kwargs):
super().__init__(**kwargs)
self._docs = DocumentArray()
self._index = faiss.IndexFlatL2(384)
@requests(on='/index')
def index(self, docs: DocumentArray, **kwargs):
self._docs.extend(docs)
_ = [self._index.add(d.embedding) for d in docs]
@requests(on='/search')
def search(self, docs: DocumentArray, **kwargs):
for doc in docs:
dists, matches = self._index.search(doc.embedding, 10) # top 10 matches
for d, i in zip(dists[0], matches[0]):
doc.matches.append(Document(self._docs[int(i)], copy=True, score=d))
def print_matches(req): # print top matches
for idx, d in enumerate(req.docs[0].matches):
print(f'[{idx}]: {d.text}')
f = Flow().add(uses=TransformerEmbed, parallel=2).add(uses=FaissIndexer)
with f:
data = load_dataset('poem_sentiment', split='train') # prepare verses for indexing
f.post('/index', (Document(text=item['verse_text']) for item in data))
f.post('/search', Document(text='a mourning man'), on_done=print_matches)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment