tadejsv/Instructions.md

## Instructions.md

      
    Raw
  

              Instructions.md
            
          
    This script indexes ~800 poem verses from the huggingface poem_sentiment dataset, and uses a transformer model to index them,
and performs a KNN search using FAISS module.
Before running, install all the requirements with these 3 commands:
conda create -n jina-2.0 -c conda-forge -c huggingface faiss-cpu datasets
conda activate jina-2.0
pip install jina sentence-transformers --pre
Here's how the output for the search phrase ("a mourning man") looks like:
[0]: sat mournfully guarding their corpses there,
[1]: dearest, why should i mourn, whimper, and whine, i that have yet to live?
[2]: taught by the sorrows that his age had known
[3]: the love that lived through all the stormy past,
[4]: some moment, nailed on sorrow's cross,
[5]: ay, knelt and worshipped on, as love in beauty's bower,
[6]: the crown of sorrow on their heads, their loss
[7]: inexorable death; and claims his right.
[8]: and the words which he utters, are--worship, or die!
[9]: and so i should be loved and mourned to-night.


## script.py
import faiss
from datasets import load_dataset
from jina import Document, DocumentArray, Executor, Flow, requests
from sentence_transformers import SentenceTransformer

class TransformerEmbed(Executor):  # Embedd text using transformers
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.model = SentenceTransformer('paraphrase-MiniLM-L6-v2', device='cpu')

    @requests
    def embedd(self, docs: DocumentArray, **kwargs):
        for d in docs:
            d.embedding = self.model.encode([d.text])  # list as faiss needs 2d arrays

class FaissIndexer(Executor):  # Simple exact FAISS indexer
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self._docs = DocumentArray()
        self._index = faiss.IndexFlatL2(384)

    @requests(on='/index')
    def index(self, docs: DocumentArray, **kwargs):
        self._docs.extend(docs)
        _ = [self._index.add(d.embedding) for d in docs]

    @requests(on='/search')
    def search(self, docs: DocumentArray, **kwargs):
        for doc in docs:
            dists, matches = self._index.search(doc.embedding, 10)  # top 10 matches
            for d, i in zip(dists[0], matches[0]):
                doc.matches.append(Document(self._docs[int(i)], copy=True, score=d))

def print_matches(req): # print top matches
    for idx, d in enumerate(req.docs[0].matches):
        print(f'[{idx}]: {d.text}')

f = Flow().add(uses=TransformerEmbed, parallel=2).add(uses=FaissIndexer)
with f:
    data = load_dataset('poem_sentiment', split='train')  # prepare verses for indexing
    f.post('/index', (Document(text=item['verse_text']) for item in data))
    f.post('/search', Document(text='a mourning man'), on_done=print_matches)
	import faiss
	from datasets import load_dataset
	from jina import Document, DocumentArray, Executor, Flow, requests
	from sentence_transformers import SentenceTransformer

	class TransformerEmbed(Executor): # Embedd text using transformers
	def __init__(self, **kwargs):
	super().__init__(**kwargs)
	self.model = SentenceTransformer('paraphrase-MiniLM-L6-v2', device='cpu')

	@requests
	def embedd(self, docs: DocumentArray, **kwargs):
	for d in docs:
	d.embedding = self.model.encode([d.text]) # list as faiss needs 2d arrays

	class FaissIndexer(Executor): # Simple exact FAISS indexer
	def __init__(self, **kwargs):
	super().__init__(**kwargs)
	self._docs = DocumentArray()
	self._index = faiss.IndexFlatL2(384)

	@requests(on='/index')
	def index(self, docs: DocumentArray, **kwargs):
	self._docs.extend(docs)
	_ = [self._index.add(d.embedding) for d in docs]

	@requests(on='/search')
	def search(self, docs: DocumentArray, **kwargs):
	for doc in docs:
	dists, matches = self._index.search(doc.embedding, 10) # top 10 matches
	for d, i in zip(dists[0], matches[0]):
	doc.matches.append(Document(self._docs[int(i)], copy=True, score=d))

	def print_matches(req): # print top matches
	for idx, d in enumerate(req.docs[0].matches):
	print(f'[{idx}]: {d.text}')

	f = Flow().add(uses=TransformerEmbed, parallel=2).add(uses=FaissIndexer)
	with f:
	data = load_dataset('poem_sentiment', split='train') # prepare verses for indexing
	f.post('/index', (Document(text=item['verse_text']) for item in data))
	f.post('/search', Document(text='a mourning man'), on_done=print_matches)