I recent did some more exploring with a local LLM tool that would import your documents into a vector store. Given the promising initial results with a handful of docs I wanted to see how it handled more / different data. I decided to copy over the text files containing Expanse trivia and answers I use as a regression suite to test my own "Q&A over documents" process. I wanted to see what types of questions it could answer from that content...
The strategy employed by this tool used double newlines as their segmentation boundary condition. A strategy that works well for many types of content however for this content that was a terrible choice as the text in the files are formatted with numbered questions followed by their answers like this:
1. Long winded question with establishing context