Skip to content

Instantly share code, notes, and snippets.

@do-me
Created May 16, 2024 15:23
Show Gist options
  • Save do-me/281f6f08b8289485b27c0a673f88bbe7 to your computer and use it in GitHub Desktop.
Save do-me/281f6f08b8289485b27c0a673f88bbe7 to your computer and use it in GitHub Desktop.
semantic_text_splitter with pandarallel multiprocessing
from semantic_text_splitter import TextSplitter
from pandarallel import pandarallel
pandarallel.initialize(progress_bar=True)
splitter = TextSplitter((1500,2000)) # equals around 512 tokens embedding model context, referring to chars here
def wrap_func(text):
return splitter.chunks(text)
df["chunks"] = df["text"].parallel_apply(wrap_func)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment