gettalong/README.md

## README.md

      
    Raw
  

              README.md
            
          
    This is a HexaPDF example for parsing content streams and working with the text parts.
HexaPDF provides the class HexaPDF::Content::Processor for processing the operators of content streams. By subclassing we can define custom behavior for each operator. This could, for instance, be used to render the contents of a page.
However, in this case we want to show how text can be handled. Since the text inside a content stream is encoded, we need to decode it before we can use it as UTF-8 string. For this HexaPDF provides two helper methods #decode_text and #decode_text_with_positioning.
The first one just decodes and returns the text itself as string. This is useful when one just wants to get basic information out of a PDF. The second one, however, returns the text together with positioning information. This could be used, for example, to correctly show the text parts of a PDF page on the console or to convert a PDF into a text file with correct text runs.
The example uses the second method to draw red boxes around each character and green boxes around each consecutive run of characters. Note that since transformations could have been applied to the characters, the bounding box for a character is not a rectangle but a parallelogram. However, it most cases using a rectangle will suffice and is much faster since less PDF content operators need to be generated for the boxes.
The test file used was the LibreOffice Manual downloaded from https://www.libreoffice.org/get-help/documentation/ with 388 pages and about 14.1MiB. It was used as argument to the show_boxes.rb file to generate a PDF with annotated characters.
When using parallelograms for showing the boxes, parsing and creating the PDF takes about 38 seconds and the resulting file is about 24.8MiB in size. Using rectangles reduces the time to 28 seconds and the file size to about 18.7MiB.
A comparable Java solution to this problem takes about 7 seconds when using rectangles (in fact, the Java solution can't use anything else).

  
## show_boxes.rb
require 'hexapdf/document'

class ShowTextProcessor < HexaPDF::Content::Processor

  def initialize(page)
    super()
    @canvas = page.canvas(type: :overlay)
  end

  def show_text(str)
    boxes = decode_text_with_positioning(str)
    return if boxes.string.empty?

    @canvas.line_width = 1
    @canvas.stroke_color(224, 0, 0)
    #boxes.each {|box| @canvas.polyline(*box.points).close_subpath.stroke}
    boxes.each do |box|
      x, y = *box.lower_left
      tx, ty = *box.upper_right
      @canvas.rectangle(x, y, tx-x, ty-y).stroke
    end
    @canvas.line_width = 0.5
    @canvas.stroke_color(0, 224, 0)
    @canvas.polyline(*boxes.lower_left, *boxes.lower_right, *boxes.upper_right, *boxes.upper_left).close_subpath.stroke
  end
  alias :show_text_with_positioning :show_text

end

doc = HexaPDF::Document.open(ARGV.shift)
doc.pages.each_page.with_index do |page, index|
  puts "Processing page #{index + 1}"
  processor = ShowTextProcessor.new(page)
  page.process_contents(processor)
end
doc.write('test-hexa.pdf')

## show_boxes_rectangle.pdf

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              show_boxes_rectangle.pdf
            
          
      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
	require 'hexapdf/document'

	class ShowTextProcessor < HexaPDF::Content::Processor

	def initialize(page)
	super()
	@canvas = page.canvas(type: :overlay)
	end

	def show_text(str)
	boxes = decode_text_with_positioning(str)
	return if boxes.string.empty?

	@canvas.line_width = 1
	@canvas.stroke_color(224, 0, 0)
	#boxes.each {\|box\| @canvas.polyline(*box.points).close_subpath.stroke}
	boxes.each do \|box\|
	x, y = *box.lower_left
	tx, ty = *box.upper_right
	@canvas.rectangle(x, y, tx-x, ty-y).stroke
	end
	@canvas.line_width = 0.5
	@canvas.stroke_color(0, 224, 0)
	@canvas.polyline(boxes.lower_left, boxes.lower_right, boxes.upper_right, boxes.upper_left).close_subpath.stroke
	end
	alias :show_text_with_positioning :show_text

	end

	doc = HexaPDF::Document.open(ARGV.shift)
	doc.pages.each_page.with_index do \|page, index\|
	puts "Processing page #{index + 1}"
	processor = ShowTextProcessor.new(page)
	page.process_contents(processor)
	end
	doc.write('test-hexa.pdf')