Skip to content

Instantly share code, notes, and snippets.

@sergeliatko
Created August 16, 2024 11:12
Show Gist options
  • Save sergeliatko/bfd4473bdd95a2b454a07d076e763337 to your computer and use it in GitHub Desktop.
Save sergeliatko/bfd4473bdd95a2b454a07d076e763337 to your computer and use it in GitHub Desktop.
SIMANTIKS API Input - transcript of video: Semantic Chunking - 3 Methods for Better RAG
Semantic Chunking - 3 Methods for Better RAG
Today, we are going to take a look at the different types of semantic chunkers
that we can use to chunk our data for applications like RAG (Retrieval-Augmented
Generation) in a more intelligent and effective way. For now, we're going to
focus on the text modality, which is generally used for RAG, but we can apply
this to video and audio as well. However, for now, let's stick with text.
I'm going to take you through three different types of semantic chunkers.
Everything we're working through today is available in the Semantic Chunkers
library, and we're going to use the Chunker’s Intro Notebook. I'll go ahead and
open this in Python using Colab.
Prerequisites
First, I'm going to install the prerequisites. You'll need Semantic Chunkers,
of course, and Hugging Face Datasets. We'll be pulling in some data to test
these different methods for chunking and to see what difference it makes, especially
in terms of latency and the quality of the results.
Data Setup
Let's take a look at our dataset. Our dataset contains a set of AI archive
papers. We can see one of them here. This is the [paper name], and you can see
there are a few different sections already. We have the title, the authors,
their affiliations, and the abstract. You can either use the full content of
the paper or just selected sections; it's up to you.
However, one of these chunkers can be pretty slow and resource-intensive, so
I’ve limited the amount of text we're using here. The other two chunkers are
pretty fast, so the limitation mainly applies to the first one. We will need
an embedding model to perform our semantic chunking. The versions of semantic
chunking we show here use or rely on embedding models to find the semantic
similarity between embeddings in some way or another.
In this example, we're going to use OpenAI's Embedding model, specifically the
text-embedding-ada-002 model. You'll need an OpenAI API key for this, but if
you prefer not to use an API key, you can use an open-source model as well. If
you want to go with the open-source model instead, you can do so here. However,
I’m going to stick with OpenAI for this demonstration.
1. Statistical Semantic Chunking
I've initialized my encoder, and now I’m going to demonstrate the statistical
chunking method. This is the chunker I recommend for most people to use right
out of the box. The reason for this is that it handles a lot of the parameter
adjustments for you. It's cost-effective and pretty fast as well, so this is
generally the one I recommend. But we’ll also take a look at the others.
The way the statistical chunker works is by identifying a good similarity threshold
value for you based on the varying similarity throughout a document. The similarity
used for different documents and different parts of documents may actually change,
but it’s all calculated for you, so it tends to work very well.
If we take a look here, we have a few chunks generated. We can see that it ran
very quickly. The first chunk includes our title, the authors, and the abstract,
which is kind of like the introduction to the paper. After that, we have what
appears to be the first paragraph of the paper, followed by the second section,
and so on. Generally speaking, these chunks look relatively good. Of course, you’ll
probably need to review them in a little more detail, but just from looking at the
start, it seems pretty reasonable.
2. Consecutive Semantic Chunking
Next is consecutive chunking, which is probably the second one I would recommend.
It’s also cost-effective and relatively quick but requires a little more tweaking
or input from the user, primarily due to the score threshold. Most encoders require
different score thresholds. For example, the text-embedding-ada-002 model typically
requires a similarity threshold within the range of 0.73 to 0.8. The newer
text-embedding models require something different, like 0.3 in this case, which
is why I've gone with that.
This chunker requires more user input, and in some cases, performance can be better.
However, it's often harder to achieve very good performance with this one. For
example, I noticed that it was splitting too frequently, so I adjusted the threshold
to 0.2, which gave more reasonable results. You might need to go even lower, but
this looks better.
This consecutive chunker works by first splitting your text into sentences and then
merging them into larger chunks. It looks for a sudden drop in similarity between
sentences, which indicates a logical point to split the chunk. That’s how it defines
where to make the split.
3. Cumulative Semantic Chunking
Finally, we have the cumulative chunker. This method starts with the first sentence,
then adds the second sentence to create an embedding, then adds the third sentence
to create another embedding, and so on. It compares these embeddings to see if there
is a significant change in similarity. If not, it continues adding sentences and
creating embeddings.
The result is that this process takes much longer and is more expensive because you
’re creating many more embeddings. However, compared to the consecutive chunker, it
is more noise-resistant, meaning it requires a more substantial change over time to
trigger a split. The results tend to be better but are usually on par or slightly
worse than the statistical chunker in many cases. Nonetheless, it's worth trying to
see what gives the best performance for your particular use case.
We can see that this chunker definitely took longer to run. Let's take a look at
the chunks it generated. While I probably should have adjusted the threshold here,
it’s clear that the performance might be slightly worse than the statistical chunker.
However, with some threshold tweaking, you can generally get better performance
than with the consecutive chunker.
Multi-modal Chunking
It's also worth noting the differences in modalities that these chunkers can handle.
The statistical chunker, for now, can only handle text modality, which is great for
RAG but not so much if you're working with video. On the other hand, the consecutive
chunker is good at handling video, and we have an example of that which I will walk
through in the near future. The cumulative chunker is also more text-focused.
For now, that’s it on semantic chunkers. I hope this has been useful and
interesting. Thank you very much for watching, and I’ll see you again in the next one. Bye!
@sergeliatko
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment