Skip to content

Instantly share code, notes, and snippets.

@bjoerntx
Created February 23, 2024 13:38
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bjoerntx/0e8f8a316e8fd30d08cea5261e7b50ba to your computer and use it in GitHub Desktop.
Save bjoerntx/0e8f8a316e8fd30d08cea5261e7b50ba to your computer and use it in GitHub Desktop.
// split a PDF document into chunks
public static List<string> Chunk(byte[] pdfDocument, int chunkSize, int overlap = 1)
{
// create a new ServerTextControl instance
using (TXTextControl.ServerTextControl tx = new TXTextControl.ServerTextControl())
{
tx.Create();
var loadSettings = new TXTextControl.LoadSettings
{
PDFImportSettings = TXTextControl.PDFImportSettings.GenerateParagraphs
};
// load the PDF document
tx.Load(pdfDocument, TXTextControl.BinaryStreamType.AdobePDF, loadSettings);
// remove line breaks
string pdfText = tx.Text.Replace("\r\n", " ");
// call the extracted chunk creation method
return CreateChunks(pdfText, chunkSize, overlap);
}
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment