This is a HexaPDF example for parsing content streams and working with the text parts.
HexaPDF provides the class HexaPDF::Content::Processor
for processing the operators of content streams. By subclassing we can define custom behavior for each operator. This could, for instance, be used to render the contents of a page.
However, in this case we want to show how text can be handled. Since the text inside a content stream is encoded, we need to decode it before we can use it as UTF-8 string. For this HexaPDF provides two helper methods #decode_text
and #decode_text_with_positioning
.
The first one just decodes and returns the text itself as string. This is useful when one just wants to get basic information out of a PDF. The second one, however, returns the text together with positioning information. This could be used, for example, to correctly show the text parts of a PDF page on the console or to convert a PDF into a text file with correct text runs.
The example uses the second method to draw red boxes around each character and green boxes around each consecutive run of characters. Note that since transformations could have been applied to the characters, the bounding box for a character is not a rectangle but a parallelogram. However, it most cases using a rectangle will suffice and is much faster since less PDF content operators need to be generated for the boxes.
The test file used was the LibreOffice Manual downloaded from https://www.libreoffice.org/get-help/documentation/ with 388 pages and about 14.1MiB. It was used as argument to the show_boxes.rb
file to generate a PDF with annotated characters.
When using parallelograms for showing the boxes, parsing and creating the PDF takes about 38 seconds and the resulting file is about 24.8MiB in size. Using rectangles reduces the time to 28 seconds and the file size to about 18.7MiB.
A comparable Java solution to this problem takes about 7 seconds when using rectangles (in fact, the Java solution can't use anything else).