This tutorial demonstrates how to manually set up a workflow for loading, embedding, and storing documents using GPT4All and Chroma DB, without the need for Langchain.
Ensure you have the following packages installed:
pip install gpt4all chromadb requests beautifulsoup4
We start by fetching a document from a web source and splitting it into smaller chunks for easier processing.
import requests
from bs4 import BeautifulSoup
def fetch_document(url):
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
return soup.get_text()
def split_text(text, chunk_size=500):
return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
# Example usage
url = "https://example.com/document"
document_text = fetch_document(url)
chunks = split_text(document_text)
Now, we embed each text chunk using GPT4All.
# Hypothetical example for GPT4All embedding
from gpt4all import GPT4AllEmbedding
embedder = GPT4AllEmbedding()
embedded_chunks = [embedder.embed(chunk) for chunk in chunks]
We then store these embeddings in a Chroma DB instance for later retrieval.
from chromadb import ChromaDB
db = ChromaDB("path_to_your_database")
for i, embedding in enumerate(embedded_chunks):
db.store(embedding, document_id=i)
Finally, implement a function for similarity search within the stored embeddings.
def search_similar_documents(query, embedder, db, top_n=5):
query_embedding = embedder.embed(query)
similar_docs = db.search(query_embedding, top_n=top_n)
return similar_docs
# Example usage
query = "Your search query"
similar_documents = search_similar_documents(query, embedder, db)
This approach allows for a flexible and manual control over the entire document embedding and retrieval process, making it ideal for custom implementations not reliant on Langchain.
@Cdaprod Wow, This is very helpful . I have also been trying to do the same, now i understood how to do it.
And i have been working on this LLM QA with this Similar documents result :
similar_documents = search_similar_documents(query, embedder, db)
I have implemented gpt4all without langchain for that, its just working for now, the prompt is very basic, but indeed it gives answer , here take a look :
We are just passing any website URL using langchain WebBaseLoader just for splits and storing in DB, got it now how to do using plain.
The prompt is just basic, we need to make it better for better results as langchain prompts.
the ChromaDB using langchain and plain implementation both helped a lot.
Thanks again .