Skip to content

Instantly share code, notes, and snippets.

@gettalong
Last active June 28, 2016 06:05
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save gettalong/891f004d72c99a24dd88384d5cbafe44 to your computer and use it in GitHub Desktop.
Save gettalong/891f004d72c99a24dd88384d5cbafe44 to your computer and use it in GitHub Desktop.
HexaPDF show_boxes.rb example

This is a HexaPDF example for parsing content streams and working with the text parts.

HexaPDF provides the class HexaPDF::Content::Processor for processing the operators of content streams. By subclassing we can define custom behavior for each operator. This could, for instance, be used to render the contents of a page.

However, in this case we want to show how text can be handled. Since the text inside a content stream is encoded, we need to decode it before we can use it as UTF-8 string. For this HexaPDF provides two helper methods #decode_text and #decode_text_with_positioning.

The first one just decodes and returns the text itself as string. This is useful when one just wants to get basic information out of a PDF. The second one, however, returns the text together with positioning information. This could be used, for example, to correctly show the text parts of a PDF page on the console or to convert a PDF into a text file with correct text runs.

The example uses the second method to draw red boxes around each character and green boxes around each consecutive run of characters. Note that since transformations could have been applied to the characters, the bounding box for a character is not a rectangle but a parallelogram. However, it most cases using a rectangle will suffice and is much faster since less PDF content operators need to be generated for the boxes.

The test file used was the LibreOffice Manual downloaded from https://www.libreoffice.org/get-help/documentation/ with 388 pages and about 14.1MiB. It was used as argument to the show_boxes.rb file to generate a PDF with annotated characters.

When using parallelograms for showing the boxes, parsing and creating the PDF takes about 38 seconds and the resulting file is about 24.8MiB in size. Using rectangles reduces the time to 28 seconds and the file size to about 18.7MiB.

A comparable Java solution to this problem takes about 7 seconds when using rectangles (in fact, the Java solution can't use anything else).

require 'hexapdf/document'
class ShowTextProcessor < HexaPDF::Content::Processor
def initialize(page)
super()
@canvas = page.canvas(type: :overlay)
end
def show_text(str)
boxes = decode_text_with_positioning(str)
return if boxes.string.empty?
@canvas.line_width = 1
@canvas.stroke_color(224, 0, 0)
#boxes.each {|box| @canvas.polyline(*box.points).close_subpath.stroke}
boxes.each do |box|
x, y = *box.lower_left
tx, ty = *box.upper_right
@canvas.rectangle(x, y, tx-x, ty-y).stroke
end
@canvas.line_width = 0.5
@canvas.stroke_color(0, 224, 0)
@canvas.polyline(*boxes.lower_left, *boxes.lower_right, *boxes.upper_right, *boxes.upper_left).close_subpath.stroke
end
alias :show_text_with_positioning :show_text
end
doc = HexaPDF::Document.open(ARGV.shift)
doc.pages.each_page.with_index do |page, index|
puts "Processing page #{index + 1}"
processor = ShowTextProcessor.new(page)
page.process_contents(processor)
end
doc.write('test-hexa.pdf')
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment