Created
August 16, 2024 11:19
-
-
Save sergeliatko/8d5f5ad5943b5fd071812651ee667c49 to your computer and use it in GitHub Desktop.
SIMANTIKS API - Structured JSON from raw text
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"index": 0, | |
"title": "Semantic Chunking - 3 Methods for Better RAG", | |
"name": "", | |
"content": "", | |
"type": "container", | |
"path": "000", | |
"children": [ | |
{ | |
"index": 0, | |
"title": "", | |
"name": "Preface: Introduction to Semantic Chunkers in RAG", | |
"content": "", | |
"type": "container", | |
"path": "000:000", | |
"children": [ | |
{ | |
"index": 0, | |
"title": "", | |
"name": "Introduction to Semantic Chunkers for Text Modality in Retrieval-Augmented Generation (RAG)", | |
"content": "Today, we are going to take a look at the different types of semantic chunkers that we can use to chunk our data for applications like RAG (Retrieval-Augmented Generation) in a more intelligent and effective way. For now, we're going to focus on the text modality, which is generally used for RAG, but we can apply this to video and audio as well. However, for now, let's stick with text.", | |
"type": "body", | |
"path": "000:000:000", | |
"children": [] | |
}, | |
{ | |
"index": 1, | |
"title": "", | |
"name": "Introduction to Three Types of Semantic Chunkers", | |
"content": "I'm going to take you through three different types of semantic chunkers.", | |
"type": "body", | |
"path": "000:000:001", | |
"children": [] | |
}, | |
{ | |
"index": 2, | |
"title": "", | |
"name": "Introduction to Semantic Chunkers Library and Usage of Chunker\u2019s Intro Notebook in Python via Colab", | |
"content": "Everything we're working through today is available in the Semantic Chunkers library, and we're going to use the Chunker\u2019s Intro Notebook. I'll go ahead and open this in Python using Colab.", | |
"type": "body", | |
"path": "000:000:002", | |
"children": [] | |
} | |
] | |
}, | |
{ | |
"index": 1, | |
"title": "Prerequisites", | |
"name": "", | |
"content": "", | |
"type": "container", | |
"path": "000:001", | |
"children": [ | |
{ | |
"index": 0, | |
"title": "", | |
"name": "Prerequisites Installation: Semantic Chunkers and Hugging Face Datasets", | |
"content": "First, I'm going to install the prerequisites. You'll need Semantic Chunkers, of course, and Hugging Face Datasets.", | |
"type": "body", | |
"path": "000:001:000", | |
"children": [] | |
}, | |
{ | |
"index": 1, | |
"title": "", | |
"name": "Data Testing for Chunking Methods: Impact on Latency and Quality of Results", | |
"content": "We'll be pulling in some data to test these different methods for chunking and to see what difference it makes, especially in terms of latency and the quality of the results.", | |
"type": "body", | |
"path": "000:001:001", | |
"children": [] | |
} | |
] | |
}, | |
{ | |
"index": 2, | |
"title": "Data Setup", | |
"name": "", | |
"content": "", | |
"type": "container", | |
"path": "000:002", | |
"children": [ | |
{ | |
"index": 0, | |
"title": "", | |
"name": "Introduction to Dataset and Structure of AI Archive Papers", | |
"content": "Let's take a look at our dataset. Our dataset contains a set of AI archive papers. We can see one of them here. This is the [paper name], and you can see there are a few different sections already. We have the title, the authors, their affiliations, and the abstract. You can either use the full content of the paper or just selected sections; it's up to you.", | |
"type": "body", | |
"path": "000:002:000", | |
"children": [] | |
}, | |
{ | |
"index": 1, | |
"title": "", | |
"name": "Limitation on Text Due to Resource-Intensive Chunker", | |
"content": "However, one of these chunkers can be pretty slow and resource-intensive, so I\u2019ve limited the amount of text we're using here. The other two chunkers are pretty fast, so the limitation mainly applies to the first one.", | |
"type": "body", | |
"path": "000:002:001", | |
"children": [] | |
}, | |
{ | |
"index": 2, | |
"title": "", | |
"name": "Requirement of Embedding Model for Semantic Chunking", | |
"content": "We will need an embedding model to perform our semantic chunking. The versions of semantic chunking we show here use or rely on embedding models to find the semantic similarity between embeddings in some way or another.", | |
"type": "body", | |
"path": "000:002:002", | |
"children": [] | |
}, | |
{ | |
"index": 3, | |
"title": "", | |
"name": "Use of OpenAI's Text-Embedding-Ada-002 Model and API Key Requirements", | |
"content": "In this example, we're going to use OpenAI's Embedding model, specifically the text-embedding-ada-002 model. You'll need an OpenAI API key for this, but if you prefer not to use an API key, you can use an open-source model as well. If you want to go with the open-source model instead, you can do so here. However, I\u2019m going to stick with OpenAI for this demonstration.", | |
"type": "body", | |
"path": "000:002:003", | |
"children": [] | |
} | |
] | |
}, | |
{ | |
"index": 3, | |
"title": "1. Statistical Semantic Chunking", | |
"name": "", | |
"content": "", | |
"type": "container", | |
"path": "000:003", | |
"children": [ | |
{ | |
"index": 0, | |
"title": "", | |
"name": "Introduction to the Statistical Chunking Method and Its Advantages", | |
"content": "I've initialized my encoder, and now I\u2019m going to demonstrate the statistical chunking method. This is the chunker I recommend for most people to use right out of the box. The reason for this is that it handles a lot of the parameter adjustments for you. It's cost-effective and pretty fast as well, so this is generally the one I recommend. But we\u2019ll also take a look at the others.", | |
"type": "body", | |
"path": "000:003:000", | |
"children": [] | |
}, | |
{ | |
"index": 1, | |
"title": "", | |
"name": "Explanation of Statistical Chunker Functionality and Similarity Threshold Calculation", | |
"content": "The way the statistical chunker works is by identifying a good similarity threshold value for you based on the varying similarity throughout a document. The similarity used for different documents and different parts of documents may actually change, but it\u2019s all calculated for you, so it tends to work very well.", | |
"type": "body", | |
"path": "000:003:001", | |
"children": [] | |
}, | |
{ | |
"index": 2, | |
"title": "", | |
"name": "Overview of Initial Document Chunking Results and Preliminary Assessment", | |
"content": "If we take a look here, we have a few chunks generated. We can see that it ran very quickly. The first chunk includes our title, the authors, and the abstract, which is kind of like the introduction to the paper. After that, we have what appears to be the first paragraph of the paper, followed by the second section, and so on. Generally speaking, these chunks look relatively good. Of course, you\u2019ll probably need to review them in a little more detail, but just from looking at the start, it seems pretty reasonable.", | |
"type": "body", | |
"path": "000:003:002", | |
"children": [] | |
} | |
] | |
}, | |
{ | |
"index": 4, | |
"title": "2. Consecutive Semantic Chunking", | |
"name": "", | |
"content": "", | |
"type": "container", | |
"path": "000:004", | |
"children": [ | |
{ | |
"index": 0, | |
"title": "", | |
"name": "Recommendation Order for Consecutive Chunking Method", | |
"content": "Next is consecutive chunking, which is probably the second one I would recommend.", | |
"type": "body", | |
"path": "000:004:000", | |
"children": [] | |
}, | |
{ | |
"index": 1, | |
"title": "", | |
"name": "Score Threshold Requirements for Various Text-Embedding Models", | |
"content": "It\u2019s also cost-effective and relatively quick but requires a little more tweaking or input from the user, primarily due to the score threshold. Most encoders require different score thresholds. For example, the text-embedding-ada-002 model typically requires a similarity threshold within the range of 0.73 to 0.8. The newer text-embedding models require something different, like 0.3 in this case, which is why I've gone with that.", | |
"type": "body", | |
"path": "000:004:001", | |
"children": [] | |
}, | |
{ | |
"index": 2, | |
"title": "", | |
"name": "User Input and Performance Adjustment for Chunker Threshold", | |
"content": "This chunker requires more user input, and in some cases, performance can be better. However, it's often harder to achieve very good performance with this one. For example, I noticed that it was splitting too frequently, so I adjusted the threshold to 0.2, which gave more reasonable results. You might need to go even lower, but this looks better.", | |
"type": "body", | |
"path": "000:004:002", | |
"children": [] | |
}, | |
{ | |
"index": 3, | |
"title": "", | |
"name": "Explanation of Consecutive Chunker Functionality", | |
"content": "This consecutive chunker works by first splitting your text into sentences and then merging them into larger chunks. It looks for a sudden drop in similarity between sentences, which indicates a logical point to split the chunk. That\u2019s how it defines where to make the split.", | |
"type": "body", | |
"path": "000:004:003", | |
"children": [] | |
} | |
] | |
}, | |
{ | |
"index": 5, | |
"title": "3. Cumulative Semantic Chunking", | |
"name": "", | |
"content": "", | |
"type": "container", | |
"path": "000:005", | |
"children": [ | |
{ | |
"index": 0, | |
"title": "", | |
"name": "Cumulative Chunker Method: Step-by-Step Embedding Process and Similarity Comparison", | |
"content": "Finally, we have the cumulative chunker. This method starts with the first sentence, then adds the second sentence to create an embedding, then adds the third sentence to create another embedding, and so on. It compares these embeddings to see if there is a significant change in similarity. If not, it continues adding sentences and creating embeddings.", | |
"type": "body", | |
"path": "000:005:000", | |
"children": [] | |
}, | |
{ | |
"index": 1, | |
"title": "", | |
"name": "Higher Time and Cost Due to Increased Embeddings Creation", | |
"content": "The result is that this process takes much longer and is more expensive because you\u2019re creating many more embeddings.", | |
"type": "body", | |
"path": "000:005:001", | |
"children": [] | |
}, | |
{ | |
"index": 2, | |
"title": "", | |
"name": "Comparison of Noise Resistance and Performance of Chunkers", | |
"content": "However, compared to the consecutive chunker, it is more noise-resistant, meaning it requires a more substantial change over time to trigger a split. The results tend to be better but are usually on par or slightly worse than the statistical chunker in many cases. Nonetheless, it's worth trying to see what gives the best performance for your particular use case.", | |
"type": "body", | |
"path": "000:005:002", | |
"children": [] | |
}, | |
{ | |
"index": 3, | |
"title": "", | |
"name": "Performance Analysis and Threshold Adjustment of the Chunker", | |
"content": "We can see that this chunker definitely took longer to run. Let's take a look at the chunks it generated. While I probably should have adjusted the threshold here, it\u2019s clear that the performance might be slightly worse than the statistical chunker.", | |
"type": "body", | |
"path": "000:005:003", | |
"children": [] | |
}, | |
{ | |
"index": 4, | |
"title": "", | |
"name": "Threshold Adjustment for Improved Performance Over Consecutive Chunker", | |
"content": "However, with some threshold tweaking, you can generally get better performance than with the consecutive chunker.", | |
"type": "body", | |
"path": "000:005:004", | |
"children": [] | |
} | |
] | |
}, | |
{ | |
"index": 6, | |
"title": "Multi-modal Chunking", | |
"name": "", | |
"content": "", | |
"type": "container", | |
"path": "000:006", | |
"children": [ | |
{ | |
"index": 0, | |
"title": "", | |
"name": "Introduction to Modalities Handled by Different Chunkers", | |
"content": "It's also worth noting the differences in modalities that these chunkers can handle.", | |
"type": "body", | |
"path": "000:006:000", | |
"children": [] | |
}, | |
{ | |
"index": 1, | |
"title": "", | |
"name": "Statistical Chunker Limitation to Text Modality", | |
"content": "The statistical chunker, for now, can only handle text modality, which is great for RAG but not so much if you're working with video.", | |
"type": "body", | |
"path": "000:006:001", | |
"children": [] | |
}, | |
{ | |
"index": 2, | |
"title": "", | |
"name": "Capabilities and Future Demonstration of the Consecutive Chunker for Video Handling", | |
"content": "On the other hand, the consecutive chunker is good at handling video, and we have an example of that which I will walk through in the near future.", | |
"type": "body", | |
"path": "000:006:002", | |
"children": [] | |
}, | |
{ | |
"index": 3, | |
"title": "", | |
"name": "Text-Focused Nature of the Cumulative Chunker", | |
"content": "The cumulative chunker is also more text-focused.", | |
"type": "body", | |
"path": "000:006:003", | |
"children": [] | |
} | |
] | |
}, | |
{ | |
"index": 7, | |
"title": "", | |
"name": "Conclusion and Sign-off for Semantic Chunkers Presentation", | |
"content": "For now, that\u2019s it on semantic chunkers. I hope this has been useful and interesting. Thank you very much for watching, and I\u2019ll see you again in the next one. Bye!", | |
"type": "body", | |
"path": "000:007", | |
"children": [] | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
See more examples here: SIMANTIKS API - Semantic Chunking Examples