Donavan/segmentation-101_part-1.md

## segmentation-101_part-1.md

      
    Raw
  

              segmentation-101_part-1.md
            
          
    Segmentation 101, part 1: Why your strategy matters

I recent did some more exploring with a local LLM tool that would import your documents into a vector store.  Given the promising initial results with a handful of docs I wanted to see how it handled more / different data.  I decided to copy over the text files containing Expanse trivia and answers I use as a regression suite to test my own "Q&A over documents" process.  I wanted to see what types of questions it could answer from that content...
The Problem With Generic Segmentation

The strategy employed by this tool used double newlines as their segmentation boundary condition. A strategy that works well for many types of content however for this content that was a terrible choice as the text in the files are formatted with numbered questions followed by their answers like this:
1. Long winded question with establishing context
 
Answer: Short answer
 
Long form answer with lots of context.

Second paragraph of long form answer
2. Long winded question with establishing context
 
Answer: Short answer
 
Long form answer with lots of context.
3. you get the pattern

Asking an open-ended question

I asked a general question "Tell me about Havelock" as I knew the documents contained several interesting bits of context about him.  Asking an open ended question like this will trigger the most hits for relevant documents as the retriever will be looking for any content about Havelock.
The context provided to the model ended up being present to the model like this:

Even after the initial riots on Ceres had calmed down, the station was tense. Havelock, the Earther, spent most of his days inside the police station processing paperwork, away from the Belter crowds outside. He also started job hunting, and ended up getting a position with Protogen's private security devision.
Havelock was from Earth, but his boss, partner, and the people he was charged to protect were all Belters. Shaddid did not care for anyone from the inner planets, and it did not help that Star Helix was an Earth-owned company. She gave Havelock the worst assignments, did not offer any opportunity for career advancement, and partnered him with the detective who nobody else wanted to work with. Havelock was a good, hard-working detective, but had never been systemically disenfranchised before.
Answer: Havelock escaped with Naomi.

As a result the model responded with this answer:

Havelock is a detective in the Protogen Security Division. He works hard to solve cases and bring criminals to justice. However, he has been assigned some difficult cases by his boss, Detective Shaddid, who does not seem to appreciate his efforts.
Naomi is a Belter who works as a mechanic at the Ceres Station. She is trying to make ends meet while her husband is away on a long-haul mission. When Havelock rescues her from a dangerous situation, she becomes determined to help him find the person who attacked her.

The answer contains a bit of information about Havelock then hallucinates  a story about Naomi because she was mentioned at the end without any context.  Even without the hallucination, the answer isn't as good as it could be due to the lack of additional context that a better strategy.
Rule #1: Preserve as much context as possible.

The more reasoning that is required for the model to provide an answer the more important context becomes. If you asked "Why is project X late?" which answer would you rather receive this strictly factual response:

Project X is late because the team has missed several important deadlines and fallen behind on delivering key aspects of the project

Or this one that was generated with the benefit of additional context?

Project X is late due to a combination of factors. While it's true that
the team has missed several deadlines, these delays were largely caused
by issues outside of their control. A crucial external vendor faced
complications that pushed back their delivery dates, which in turn has
delayed the overall project.

The difference between a strictly true answer and an actually true answer in this example is a result of using larger segments that provided additional context.
Rule #2: Don't mix context

Whenever possible you want each segment to stay focused around a specific topic. Extra context is great for reasoning, extra unrelated context is not so it's helpful to use features of the text to help determine segmentation boundaries. For example, when loading a word document, you might start a new segment each time you encounter a heading as those tend to indicate new topics.
Note: That doesn't mean you cram all of the text under that header into a single segment, rather that you don't want to include content from two different headers in the same chunk.
A strategy for trivia questions

The strategy I used when I segmented and indexed this content myself is to keep the question and it's answer and pretty much ignore token limits on the segments.  I knew most of the segments would be in the region of 200 tokens, that everything contained in a question and it's answer was somehow related and that this format was fairly information dense.  Having an occasional outsized chunk was a small price to pay to preserve context.
Using this strategy the context supplied to the model would have been 3x the size (626 tokens) yet small enough to fit into the context window 3.5-turbo-4k if we wanted.  There were roughly 200 tokens of context information for all three segments provided to the model in the original Havelock test. 200 tokens is about the minimum size for a single segment if you want quality responses to questions that have nuance.
Note: Bigger isn't always better and no one size fits all content or all questions.

Improved context example

The information below represents what the model would have been given following my strategy:

7 Miller's partner, Havelock, was starting to feel unsafe on Ceres station. Where did he end up getting a new job?
Answer: Protogen
Even after the initial riots on Ceres had calmed down, the station was tense. Havelock, the Earther, spent most of his days inside the police station processing paperwork, away from the Belter crowds outside. He also started job hunting, and ended up getting a position with Protogen's private security devision.
It paid well and was really high tech, but also a bit intimidating. This move would end up being both beneficial and dangerous.
1 Josephus Miller and Dimitri Havelock were detectives on Ceres station. Why did Captain Shaddid, their boss, dislike Havelock?
Answer: Havelock was an Earther.
People who grew up in low gravity were often taller and slimmer than those who grew up "down the well". When a shorter, stockier Earther went to a place like Ceres station, they stood out, and often suffered both quiet and overt racism. A lot of Earthers were also racist towards Belters, but Belters could not easily travel down to Earth's surface.
Havelock was from Earth, but his boss, partner, and the people he was charged to protect were all Belters. Shaddid did not care for anyone from the inner planets, and it did not help that Star Helix was an Earth-owned company. She gave Havelock the worst assignments, did not offer any opportunity for career advancement, and partnered him with the detective who nobody else wanted to work with. Havelock was a good, hard-working detective, but had never been systemically disenfranchised before.
9 During the rescue attempt to get Naomi Nagata off the Edward Israel and back to the Rocinante, what happened to her jailer, Dimitri Havelock?
Answer: Havelock escaped with Naomi.
For most of the novel, Havelock trusted Murtry to prioritize the safety of RCE employees above everything else. His trust was slowly eroded as the situation on New Terra deteriorated. Murtry claimed to be interested in protecting RCE employees and property, but he was actually more interested in establishing RCE's claim to the planet over the original colonists. Murtry even explicitly asked for a shelter with RCE markings be shipped down after the storm hit, so when the next wave of humans showed up after the first ones had all died, the only structure on the planet would support the RCE claim.
When Basia Merton showed up to rescue Naomi, Havelock finally made his choice. He chose the side that wanted to try to save the humans on New Terra, rather than erect a headstone. He was Naomi's guard, so he was able to get her out past the militia before they even knew he had changed sides. He went with Naomi and Basia back to the Rocinante.

Pro-tip: Save your segment size in tokens as metadata for your segments

It's important to know how many segments you can provide the model and fit inside the context window. Even when you're using a token limit to determine segment size that limit defines the upper limit of a segment.  If you're using a strategy where you start a new segment when encountering a header, quite often your segments will be smaller than the limit. Likewise with a strategy like the one I used for the trivia questions some chunks will also be over the token limit.
By storing the actual size of each segment you can do the math and determine how many of the relevant segments can fit into the context.  You may actually be able to fit more, or need to drop one in order to fit the initial window.
Better answers with context

When provided a richer context the model can produce a response like this:

Dimitri Havelock was initially a detective on Ceres Station, where he was partnered with Josephus Miller. He was originally from Earth, a fact that made him stand out amongst the majority Belters on the station. His physical difference and origins led to him being subject to both overt and subtle racism. His boss, Captain Shaddid, harbored a dislike for anyone from the inner planets and thus, she would frequently give Havelock the worst assignments and block his career progression.
Despite the adverse conditions, Havelock was a hard-working and good detective who was not used to being systemically disenfranchised. Nonetheless, Havelock began feeling unsafe on Ceres station. Eventually, he ended up securing a position with Protogen's private security, a move that brought both benefits and danger.
His role would further change during the events of the rescue mission to free Naomi Nagata from the Edward Israel; Havelock held a position that allowed him to guard her. His trust in Murtry's intentions had wavered over time, and when Basia Merton arrived to rescue Naomi, Havelock decided to defect. He successfully got Naomi past the militia and accompanied her and Basia back to the Rocinante.

This answer pulled from the same corpus is much richer, provides far more detail, and allows to model to understand the interaction between Havelock and Naomi.
Parting thoughts

This post is FAR from all there is to say on the topic of segmentation. In future posts we'll talk more about the topic as we go over how "Q&A over documentation" works by leveraging a vector store.