Skip to content

Instantly share code, notes, and snippets.

@steniowagner
Last active February 15, 2024 20:58
Show Gist options
  • Save steniowagner/0dd3fc02beaed3b085077f686d0f58b8 to your computer and use it in GitHub Desktop.
Save steniowagner/0dd3fc02beaed3b085077f686d0f58b8 to your computer and use it in GitHub Desktop.

Context

I was explaining to a non-technical stakeholder, in the context of Artificial Inteligence, how the process of divide a large text into smaller texts (chunks) works. We used a library called LangChain to make this process, so the explanation was about the internals of how LangChain create the chunks from the input text.

This investigation started because we wanted to understand (and, if possible, to control) how this library was breaking down the text to make the chunks.

PS: this conversation happened in a slack-channel on october 26th, 2023, and the content here is simply a copy-and-paste of the conversation.

PS²: Pinecone, mentioned in the end of the conversation, is a vector-database. We use it to store the text-chunks, so we could query it by giving some piece of text, and it would return similar chunks that it has stored.

---- the conversation starts here ----

Stenio Wagner [1:20 AM]:

Ok, I cracked how the chunks are made.

Some considerations:

  • Currently, the functionality indicated by the LLM that we’re using to breaking down generic texts into chunks is the RecursiveCharacterTextSplitter It works using a strategy called “recursion”, where we basically will be splitting up the same peace of text until we reach a certain condition.

  • This condition basically will check if the chunk that we currently generated has a certain number of characters (let’s call it chunk_size). If we already reached that number, we can consider that we generated a valid chunk. If not, we’ll get the next character in the text until we reach a chunk with the required length.

  • The algorithm will be splitting the text using the following characters (in the order presented) - Let’s call them divider_characters: \n\n, \n, (empty space), '' (the empty character)

  • Each \n represents an line-breaking in the text.

  • The \n\n is only used once in order to group the paragraphs.

The algorithm

This is a high level explanation of the execution of this algorithm!

1 - Split the text using the \n\n

2 - For each group created after the split, do:

1 - Split up the group using the next character in divider_characters

For each group created after the split, do:
2 - Analyze if the group has less characters than the number specified in chunk_size.
3 - If this group has less characters than divider_characters, we concatenate this group with the next group (if the sum of their lengths are less than or equal to chunk_size.
4 - If not and we still have some divider_character to analyze against the chunk, repeat this loop.

Running

As an example, consider:

divider_characters = \n\n, \n, (empty space), ''

chunk_size = 10

The text-content:

"Beloved Brazil

Roses are red
Violets are blue
Brazil is beautiful
How are you ?"

1 - Divide the text using \n\n

Result:

Group 1: Beloved Brazil

Group 2: Roses are red\nViolets are blue\nBrazil is beautiful\nHow are you ?

2.1 - Analyzing the Group 1:

The length of this group is greater than chunk_size , so we have to break it again using the next character in divider_characters (\n), but since we don’t have it, the split won’t run, and we can use the next divider_character valid for this group ' '(empty space) .

2.1.2 - When we split up the group with the empty space, we have:

Group 1.1: Beloved

Group 1.2: Brazil

We can’t merge both groups, since it would generate a chunk with more characters than chunk_size , so we grab the next divider_character , the empty character .

In this case, we’ll less characters than chunk_size again. So, since we’re out of divider_characters, we can have the Group 1.1 and Group 1.2 as valid chunks.

Chunks so far: [Beloved, Brazil]

2.1 - Analyzing the Group 2:

The length of this group is larger than chunk_size , so we can’t consider it a valid chunk. Then, we split it with the next divider_character : \n

Group 1: Roses are red

Group 2: Violets are blue

Group 3: Brazil is beautiful

Group 4: How are you ?

Here, as we can see, all groups have more than chunk_size characters. So, we can split them up again using our next divider_character: empty space

Group 1.1: Roses

Group 1.2: are

Group 1.3: red

Now, we can try to concatenate Group 1 and Group 2 - It works, since their length is less than or equal chunk_size. But, if we try to merge Group 3 as well the length of the group will be greater than chunk_size. So, we can consider Roses are and red as valid chunks.

Chunks so far: [Beloved, Brazil , Roses are, red]

I don’t want to make it too long, but I think that you already get it. We just need to apply the same idea again to the other groups, and our final result will be:

Chunks:

Beloved, Brazil , Roses are, red , Violets, are blue, Brazil is, beautiful, How are, you ?

With this in mind, we can generate chunks with different sizes. The default value for the chunk_size used by the RecursiveCharacterTextSplitter is the 1000.

It also worth to mention that we’re limited by the gpt and by the limit of tokens used in our question + their answer, so we can’t have, for example, a huge chunk, eg the entire text.Also, we have a parameter called chunk_overlap, that basically overlap the first n characters of the next chunk with the n last characters of the current chunk, where n is the chunk_overlap.

With this, we can have some of the context of the current chunk into the next chunk.Anyway, I’m still afraid of loose some context in the chunks generated. If it’s too small, it can definitely lose context. But we can’t predict how big it can be, since every text will be unique.

Given that, I propose that we start to try to make the chunks of the size of one page. This way, it won’t be too long, because we’re not considering the whole document, and not to small, since we’ll have at least one page of content to analyze. My idea is to send this page-chunk to pinecone, and ask to generate the learning-objectives from it in isolation, instead of having to analyze all chunks.

I think that we can do that in pinecone, but I didn’t try yet, tough. If it doesn’t work, I’ll elaborate something else.

Toughts?

The Stakeholder [6:12 AM]:

Nice work!

Yes agreed. Lets do one topic per page, then distil those topics down to [user defined number], then write as learning objective for each topic

and have a little overlap between chunks to allow for context creep

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment