Skip to content

Instantly share code, notes, and snippets.

@Cdaprod
Last active April 8, 2024 06:35
Show Gist options
  • Save Cdaprod/c73d1fa060f2674a417e5748965710b3 to your computer and use it in GitHub Desktop.
Save Cdaprod/c73d1fa060f2674a417e5748965710b3 to your computer and use it in GitHub Desktop.

Tutorial: Implementing GPT4All Embeddings and Chroma DB without Langchain

This tutorial demonstrates how to manually set up a workflow for loading, embedding, and storing documents using GPT4All and Chroma DB, without the need for Langchain.

Prerequisites

Ensure you have the following packages installed:

pip install gpt4all chromadb requests beautifulsoup4

Step 1: Load and Split the Document

We start by fetching a document from a web source and splitting it into smaller chunks for easier processing.

import requests
from bs4 import BeautifulSoup

def fetch_document(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    return soup.get_text()

def split_text(text, chunk_size=500):
    return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]

# Example usage
url = "https://example.com/document"
document_text = fetch_document(url)
chunks = split_text(document_text)

Step 2: Embedding Text Chunks

Now, we embed each text chunk using GPT4All.

# Hypothetical example for GPT4All embedding
from gpt4all import GPT4AllEmbedding

embedder = GPT4AllEmbedding()

embedded_chunks = [embedder.embed(chunk) for chunk in chunks]

Step 3: Storing in Chroma DB

We then store these embeddings in a Chroma DB instance for later retrieval.

from chromadb import ChromaDB

db = ChromaDB("path_to_your_database")

for i, embedding in enumerate(embedded_chunks):
    db.store(embedding, document_id=i)

Step 4: Similarity Search

Finally, implement a function for similarity search within the stored embeddings.

def search_similar_documents(query, embedder, db, top_n=5):
    query_embedding = embedder.embed(query)
    similar_docs = db.search(query_embedding, top_n=top_n)
    return similar_docs

# Example usage
query = "Your search query"
similar_documents = search_similar_documents(query, embedder, db)

This approach allows for a flexible and manual control over the entire document embedding and retrieval process, making it ideal for custom implementations not reliant on Langchain.

Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@niranjanadas
Copy link

can you please show the plain gpt4all embeddings and chroma db implementation, without any langchain support, we just wanted to know for higher intuition.

@Cdaprod
Copy link
Author

Cdaprod commented Nov 29, 2023

can you please show the plain gpt4all embeddings and chroma db implementation, without any langchain support, we just wanted to know for higher intuition.

Consider it done :)

I’ve outlined a hypothetical step by step on it and added it as a markdown file to the gist.

@niranjanadas
Copy link

@Cdaprod Wow, This is very helpful . I have also been trying to do the same, now i understood how to do it.

And i have been working on this LLM QA with this Similar documents result :
similar_documents = search_similar_documents(query, embedder, db)

I have implemented gpt4all without langchain for that, its just working for now, the prompt is very basic, but indeed it gives answer , here take a look :

from gpt4all import GPT4All

llm = GPT4All("path_to_models/mistral-7b-openorca.Q8_0.gguf", allow_download=False)

#vectorstore = Chroma(persist_directory="./fight_club", embedding_function=GPT4AllEmbeddings()) ## using langchain

def answer_question(question):
    # Fetch similar documents from the vector DB
    similar_documents = vectorstore.similarity_search(question, k=3)## using langchain
    similar_documents = search_similar_documents(query, embedder, db) ## we can use your implementation also to fecth

    print(f"similar_documents : {similar_documents}")

    context = "Question: " + question + "\n\n"
    
    for idx, doc in enumerate(similar_documents, start=1):
        context += f"Document {idx}:\n"
        context += f"Source: {doc.metadata['source']}\n"
        context += f"Title: {doc.metadata['title']}\n"
        context += f"Content: {doc.page_content}\n\n"

    # Generate an answer based on the context
    response = llm.generate(context, streaming=True)
    for token in response:
        print(token, end='', flush=True)
    print("\n")

# Start the conversation loop
with llm.chat_session():
  while True:
      question = input("Ask Question to Docs: ")
      answer_question(question)

We are just passing any website URL using langchain WebBaseLoader just for splits and storing in DB, got it now how to do using plain.
The prompt is just basic, we need to make it better for better results as langchain prompts.
the ChromaDB using langchain and plain implementation both helped a lot.

Thanks again .

@Devanshu-17
Copy link

Devanshu-17 commented Feb 6, 2024

@Cdaprod @niranjanadas

Hello, I am trying to use GPT4ALL and ChromaDB without Langchain, but I am getting an error. I tried to follow both your instructions, but am not able to resolve it. Can you please look at my code snippet once?

Thanks in advance

import requests
from bs4 import BeautifulSoup
from gpt4all import GPT4All, Embed4All
import chromadb
import uuid

def fetch_document(url):
    try:
        response = requests.get(url)
        soup = BeautifulSoup(response.content, 'html.parser')
        return soup.get_text()
    except Exception as e:
        print(f"Error fetching document: {e}")
        return None

def split_text(text, chunk_size=500):
    return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]

def store_embeddings(embedded_chunks):
    try:
        client = chromadb.PersistentClient(path="/data/chroma.sqlite3")
        collection = client.get_or_create_collection(name="my_collection")
        
        for embedded_chunk in embedded_chunks:
            chunk_id = str(uuid.uuid4())
            collection.add(embeddings=embedded_chunk, ids=[chunk_id])
        
        return collection  # Return the collection object
    except Exception as e:
        print(f"Error storing embeddings: {e}")
        return None

def search_similar_documents(query, embedder, db, top_n=5):
    try:
        query_embedding = embedder.embed(query)
        similar_docs = db.query(query_embedding, top_n=top_n)
        return similar_docs
    except Exception as e:
        print(f"Error searching similar documents: {e}")
        return []

# Initialize the Chroma database
try:
    db = chromadb.PersistentClient(path="/data/chroma.sqlite3")
except Exception as e:
    print(f"Error initializing Chroma database: {e}")
    db = None

# Example usage
url = "https://docs.trychroma.com/"
document_text = fetch_document(url)
if document_text:
    chunks = split_text(document_text)
    embedder = Embed4All()
    embedded_chunks = [embedder.embed(chunk) for chunk in chunks]
    collection = store_embeddings(embedded_chunks)  # Store the collection object
else:
    print("Failed to fetch document.")

llm = GPT4All("/llama-2-7b-chat.Q4_0.gguf", allow_download=False)

def answer_question(question, embedder, db, collection):
    try:
        similar_documents = search_similar_documents(question, embedder, db)
        print(f"similar_documents : {similar_documents}")

        context = "Question: " + question + "\n\n"
        
        for idx, doc in enumerate(similar_documents, start=1):
            context += f"Document {idx}:\n"
            context += f"Source: {doc.metadata['source']}\n"
            context += f"Title: {doc.metadata['title']}\n"
            context += f"Content: {doc.page_content}\n\n"

        response = llm.generate(context, streaming=True)
        for token in response:
            print(token, end='', flush=True)
        print("\n")
    except Exception as e:
        print(f"Error answering question: {e}")

# Start the conversation loop
with llm.chat_session():
    while True:
        question = input("Ask Question to Docs: ")
        answer_question(question, embedder, db, collection)  # Pass the collection object to the function

Error:

bert_load_from_file: gguf version     = 2
bert_load_from_file: gguf alignment   = 32
bert_load_from_file: gguf data offset = 695552
bert_load_from_file: model name           = BERT
bert_load_from_file: model architecture   = bert
bert_load_from_file: model file type      = 1
bert_load_from_file: bert tokenizer vocab = 30522
Ask Question to Docs: What is Chroma?
Error searching similar documents: 'Client' object has no attribute 'query'
similar_documents : []

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment