Created
August 16, 2024 11:12
-
-
Save sergeliatko/bfd4473bdd95a2b454a07d076e763337 to your computer and use it in GitHub Desktop.
SIMANTIKS API Input - transcript of video: Semantic Chunking - 3 Methods for Better RAG
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Semantic Chunking - 3 Methods for Better RAG | |
Today, we are going to take a look at the different types of semantic chunkers | |
that we can use to chunk our data for applications like RAG (Retrieval-Augmented | |
Generation) in a more intelligent and effective way. For now, we're going to | |
focus on the text modality, which is generally used for RAG, but we can apply | |
this to video and audio as well. However, for now, let's stick with text. | |
I'm going to take you through three different types of semantic chunkers. | |
Everything we're working through today is available in the Semantic Chunkers | |
library, and we're going to use the Chunker’s Intro Notebook. I'll go ahead and | |
open this in Python using Colab. | |
Prerequisites | |
First, I'm going to install the prerequisites. You'll need Semantic Chunkers, | |
of course, and Hugging Face Datasets. We'll be pulling in some data to test | |
these different methods for chunking and to see what difference it makes, especially | |
in terms of latency and the quality of the results. | |
Data Setup | |
Let's take a look at our dataset. Our dataset contains a set of AI archive | |
papers. We can see one of them here. This is the [paper name], and you can see | |
there are a few different sections already. We have the title, the authors, | |
their affiliations, and the abstract. You can either use the full content of | |
the paper or just selected sections; it's up to you. | |
However, one of these chunkers can be pretty slow and resource-intensive, so | |
I’ve limited the amount of text we're using here. The other two chunkers are | |
pretty fast, so the limitation mainly applies to the first one. We will need | |
an embedding model to perform our semantic chunking. The versions of semantic | |
chunking we show here use or rely on embedding models to find the semantic | |
similarity between embeddings in some way or another. | |
In this example, we're going to use OpenAI's Embedding model, specifically the | |
text-embedding-ada-002 model. You'll need an OpenAI API key for this, but if | |
you prefer not to use an API key, you can use an open-source model as well. If | |
you want to go with the open-source model instead, you can do so here. However, | |
I’m going to stick with OpenAI for this demonstration. | |
1. Statistical Semantic Chunking | |
I've initialized my encoder, and now I’m going to demonstrate the statistical | |
chunking method. This is the chunker I recommend for most people to use right | |
out of the box. The reason for this is that it handles a lot of the parameter | |
adjustments for you. It's cost-effective and pretty fast as well, so this is | |
generally the one I recommend. But we’ll also take a look at the others. | |
The way the statistical chunker works is by identifying a good similarity threshold | |
value for you based on the varying similarity throughout a document. The similarity | |
used for different documents and different parts of documents may actually change, | |
but it’s all calculated for you, so it tends to work very well. | |
If we take a look here, we have a few chunks generated. We can see that it ran | |
very quickly. The first chunk includes our title, the authors, and the abstract, | |
which is kind of like the introduction to the paper. After that, we have what | |
appears to be the first paragraph of the paper, followed by the second section, | |
and so on. Generally speaking, these chunks look relatively good. Of course, you’ll | |
probably need to review them in a little more detail, but just from looking at the | |
start, it seems pretty reasonable. | |
2. Consecutive Semantic Chunking | |
Next is consecutive chunking, which is probably the second one I would recommend. | |
It’s also cost-effective and relatively quick but requires a little more tweaking | |
or input from the user, primarily due to the score threshold. Most encoders require | |
different score thresholds. For example, the text-embedding-ada-002 model typically | |
requires a similarity threshold within the range of 0.73 to 0.8. The newer | |
text-embedding models require something different, like 0.3 in this case, which | |
is why I've gone with that. | |
This chunker requires more user input, and in some cases, performance can be better. | |
However, it's often harder to achieve very good performance with this one. For | |
example, I noticed that it was splitting too frequently, so I adjusted the threshold | |
to 0.2, which gave more reasonable results. You might need to go even lower, but | |
this looks better. | |
This consecutive chunker works by first splitting your text into sentences and then | |
merging them into larger chunks. It looks for a sudden drop in similarity between | |
sentences, which indicates a logical point to split the chunk. That’s how it defines | |
where to make the split. | |
3. Cumulative Semantic Chunking | |
Finally, we have the cumulative chunker. This method starts with the first sentence, | |
then adds the second sentence to create an embedding, then adds the third sentence | |
to create another embedding, and so on. It compares these embeddings to see if there | |
is a significant change in similarity. If not, it continues adding sentences and | |
creating embeddings. | |
The result is that this process takes much longer and is more expensive because you | |
’re creating many more embeddings. However, compared to the consecutive chunker, it | |
is more noise-resistant, meaning it requires a more substantial change over time to | |
trigger a split. The results tend to be better but are usually on par or slightly | |
worse than the statistical chunker in many cases. Nonetheless, it's worth trying to | |
see what gives the best performance for your particular use case. | |
We can see that this chunker definitely took longer to run. Let's take a look at | |
the chunks it generated. While I probably should have adjusted the threshold here, | |
it’s clear that the performance might be slightly worse than the statistical chunker. | |
However, with some threshold tweaking, you can generally get better performance | |
than with the consecutive chunker. | |
Multi-modal Chunking | |
It's also worth noting the differences in modalities that these chunkers can handle. | |
The statistical chunker, for now, can only handle text modality, which is great for | |
RAG but not so much if you're working with video. On the other hand, the consecutive | |
chunker is good at handling video, and we have an example of that which I will walk | |
through in the near future. The cumulative chunker is also more text-focused. | |
For now, that’s it on semantic chunkers. I hope this has been useful and | |
interesting. Thank you very much for watching, and I’ll see you again in the next one. Bye! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
See more examples here: SIMANTIKS API - Semantic Chunking Examples