Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save janduplessis883/8ff94dc5a174e5f4afed5902c39865d9 to your computer and use it in GitHub Desktop.
Save janduplessis883/8ff94dc5a174e5f4afed5902c39865d9 to your computer and use it in GitHub Desktop.
Embedding Data from a Pandas DataFrame into a Chroma Vector Database using LangChain and Ollama
import pandas as pd
from langchain.schema import Document
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from tqdm import tqdm
# Initialize the embedding model
embedding_model = OllamaEmbeddings(model="nomic-embed-text", show_progress=False)
# Initialize Chroma Vector Store (this assumes that you do not need to from_documents here directly)
# Assuming vector_db needs to be setup only once
vector_db = Chroma(collection_name="GP_Surgery_Reviews")
def embed_with_chroma(df, embedding_model):
embeddings = []
# Process each row in the DataFrame with a progress bar
for index, row in tqdm(df.iterrows(), total=df.shape[0]):
# Create a Document with necessary fields
document = Document(
page_content=row['review'], # Text content for embedding
meta_data={'pcn': row['pcn'], 'surgery': row['surgery']}, # Additional meta-data
id=str(row['index']) # Unique identifier as string
)
# Generate embedding using the correct embedding method
try:
# As 'embed_documents' expects a list of documents, we pass a list with one document
# and then take the first (and only) embedding from the returned list
embedding = embedding_model.embed_documents([document.page_content])[0]
embeddings.append((document, embedding))
except Exception as e:
print(f"Failed to embed document: {e}")
return embeddings
# Example DataFrame
data = {
'index': [1, 2, 3],
'review': ['Great service!', 'Needs improvement.', 'Very satisfied.'],
'pcn': ['PCN123', 'PCN456', 'PCN789'],
'surgery': ['SurgeryA', 'SurgeryB', 'SurgeryC']
}
df = pd.DataFrame(data)
# Get embeddings and store them in Chroma
document_embeddings = embed_with_chroma(nnew_data, embedding_model)
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
# Initialize your embedding model
embedding_model = OllamaEmbeddings(model="nomic-embed-text", show_progress=True)
# Initialize Chroma, ensure you provide the correct directory if `persist_directory` is valid
chroma = Chroma(embedding_function=embedding_model, persist_directory=DATA_PATH)
def ensure_collection_exists(chroma, collection_name):
"""Ensure the collection exists using generic methods available."""
try:
# Try to retrieve the collection
collection = chroma.get(collection_name)
return collection
except ValueError:
# Handle the situation if collection doesn't exist
print(f"Collection {collection_name} does not exist.")
return None
def search_similar_documents(chroma, collection_name, query_text, k=5):
"""Search for documents similar to the given query text within the specified collection."""
try:
# Use Chroma's similarity_search method
results = chroma.similarity_search(query_text, k=k, collection_name=collection_name)
return results
except Exception as e:
print(f"An error occurred during the search: {e}")
return []
# Example usage
collection_name = "GP_Surgery_Reviews"
collection = ensure_collection_exists(chroma, collection_name)
if collection:
query_text = "Appointment Availability"
similar_documents = search_similar_documents(chroma, collection_name, query_text)
# Display the results
for doc in similar_documents:
try:
print(f"Document: {doc.page_content}, Similarity Score: {doc.metadata['score']}")
except KeyError:
print("Error processing document data; required keys not found.")
@janduplessis883
Copy link
Author

Ollama ChromaDB Pandas DF

The provided GitHub Gist repository contains Python code that demonstrates how to embed data from a Pandas DataFrame into a Chroma vector database using LangChain and Ollama. The main objective of this code is to simplify the process of transforming text data from a DataFrame into vector representations that can be stored in a vector database.

To achieve this, the code first creates a sample DataFrame df with four columns:index, review, pcn, and surgery. The review column contains text reviews, which will be used as input for the embedding process. The other three columns contain additional metadata that can be associated with each review.

The code then defines a function to embed these text reviews into vector representations using an embedding model. This function, called embed_with_chroma, takes two inputs: the DataFrame and the embedding model. It creates a list of documents from the DataFrame, where each document is represented by its corresponding review text, along with some additional metadata.

The code then uses the embed_documents method to generate embeddings for these documents. This method expects a list of documents as input and returns a list of embeddings, one for each document. The code takes the first (and only) embedding from this list and stores it in the embeddings variable.

By simplifying the process of embedding text data from a DataFrame into a vector database, this code enables users to leverage the power of vector databases for various natural language processing tasks, such as information retrieval, clustering, or classification. The use of LangChain and Ollama libraries further enhances the performance and efficiency of the embedding process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment