Skip to content

Instantly share code, notes, and snippets.

@dbasch
Created March 29, 2023 21:55
Show Gist options
  • Star 22 You must be signed in to star a gist
  • Fork 7 You must be signed in to fork a gist
  • Save dbasch/73e3882bcfd10c485cf7f91c81074064 to your computer and use it in GitHub Desktop.
Save dbasch/73e3882bcfd10c485cf7f91c81074064 to your computer and use it in GitHub Desktop.
"""
Download your tweet archive from Twitter.
here will be a file called data/tweets.js. It will contain a single variable
assigned to an array of tweet objects.
Edit it, leave only the array and rename it to tweets.json.
This requires having chromadb and InstructorEmbedding installed via pip.
"""
from chromadb.config import Settings
from chromadb.utils import embedding_functions
import chromadb
import json
import time
dirname = "mytweets"
#remove the device parameter below if you don't have a cuda-capable gpu
embeddings = embedding_functions.InstructorEmbeddingFunction(device='cuda')
alltweets = json.load(open("tweets.json"))
tweets = [t['tweet'] for t in alltweets if not t['tweet']['full_text'].startswith("RT")]
client = chromadb.Client(Settings(chroma_db_impl="duckdb+parquet",
persist_directory=dirname))
alltweets = json.load(open("tweets.json"))
tweets = [t['tweet'] for t in alltweets if not t['tweet']['full_text'].startswith("RT")]
total = len(tweets)
print(f"we have {total} tweets.")
coll = client.get_or_create_collection("tweets", embedding_function=embeddings)
if coll.count() != total:
i = 0
batch_size = 20 #that's how much my gpu can do at a time
toembed = [t["full_text"] for t in tweets]
ids = [str(i) for i in range(total)]
before = time.time()
while i < len(toembed):
coll.add(documents=toembed[i:i+batch_size], metadatas=None, ids=ids[i:i+batch_size])
i += batch_size
print(f"embedded: {i}")
t = time.time() - before
while True:
query = input("query: ")
try:
response = coll.query(query_texts=query, n_results = 10)
for i, t in enumerate(response['documents'][0]):
print(i, t)
except Exception as e:
print(e)
@mkmohangb
Copy link

mkmohangb commented Sep 13, 2023

  1. Lines 18,19 can be removed.
  2. Also had to call client.persist() after adding documents to the collection
  3. This is with chromadb version 0.3.21
  4. For latest version(0.4.10) replace client creation with this line: client = chromadb.PersistentClient(path=dirname)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment