Skip to content

Instantly share code, notes, and snippets.

@gettalong
Last active December 31, 2020 12:39
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save gettalong/8955ff5403fe7abb7bee to your computer and use it in GitHub Desktop.
Save gettalong/8955ff5403fe7abb7bee to your computer and use it in GitHub Desktop.
HexaPDF Performance Comparison

A short and very unscientific comparison of the performance of HexaPDF to other PDF utilities when reading, eventually optimizing and then writing a file.

When available, multiple compression modes are compares:

  • No indicator - no compression done
  • C - Compacting by removing unused and deleted objects
  • S - Usage of object and cross-reference streams
  • P - Recompression of page content streams

For the HexaPDF tests, the hexapdf binary was used with different options for the optimization command:

hexapdf -f optimize --no-compact --object-streams=preserve --xref-streams=preserve --streams=preserve --no-optimize-fonts 
hexapdf -f optimize --compact --object-streams=preserve --xref-streams=preserve --streams=preserve --no-optimize-fonts 
hexapdf -f optimize ${OUT_FILE}
hexapdf -f optimize --compress-pages ${OUT_FILE}

The Ruby origami PDF library was used like this:

require 'origami'
Origami::OPTIONS[:enable_type_propagation] = false
pdf = Origami::PDF.read(ARGV.shift)
pdf.save(ARGV.shift, :noindent => true, :use_xrefstm => true, :use_xreftable => false, :obfuscate => false)

The Ruby combine_pdf library was used like this (no file optimization is done by this library):

require 'combine_pdf'
CombinePDF.load(ARGV.shift).save(ARGV.shift)

The other tools (all compiled binaries) were run like this:

  • pdftk IN.pdf output OUT.pdf (note that no object stream generation is performed)
  • qpdf IN.pdf OUT.pdf
  • qpdf --object-streams=generate IN.pdf OUT.pdf
  • smpdf IN.pdf -o OUT.pdf (this binary is purpose-built for PDF file size compression)

Five different input files were used:

  • a.pdf: 53.129 bytes, 36 indirect objects, 4 pages, created by Prawn
  • b.pdf: 11.520.218 bytes, 4161 indirect objects, 439 pages, many non-stream objects
  • c.pdf: 14.399.980 bytes, 5263 indirect objects, 620 pages, linearized, many streams
  • d.pdf: 8.107.348 bytes, 34513 indirect objects, 20 pages
  • e.pdf: 21.788.087 bytes, 2296 indirect objects, 52 pages, huge content streams, many pictures, object streams, encrypted with default password
  • f.pdf: 154.752.614 bytes, 287.977 indirect objects, 28.365 pages, very big file

Testing was done with the attached script.sh.

Individual results (2017-03-19, with the repository version of HexaPDF and Ruby 2.4.0p0):

|------------------------------------------------------------|
| a.pdf (53,056)       |     Time |     Memory |   File size |
|------------------------------------------------------------|
| hexapdf              |    198ms |  13,824KiB |      52,338 |
| hexapdf C            |    195ms |  14,040KiB |      52,315 |
| hexapdf CS           |    141ms |  14,352KiB |      49,180 |
| hexapdf CSP          |    196ms |  14,252KiB |      48,250 |
| origami              |    331ms |  24,056KiB |      52,312 |
| combine_pdf          |    126ms |  16,284KiB |      53,695 |
| pdftk C?             |    175ms |  53,892KiB |      53,144 |
| qpdf C               |     20ms |   4,640KiB |      53,179 |
| qpdf CS              |     17ms |   4,648KiB |      49,287 |
| smpdf CSP            |     36ms |   9,448KiB |      48,329 |
|------------------------------------------------------------|

|------------------------------------------------------------|
| b.pdf (11,520,218)   |     Time |     Memory |   File size |
|------------------------------------------------------------|
| hexapdf              |    952ms |  31,924KiB |  11,464,892 |
| hexapdf C            |  1,022ms |  30,748KiB |  11,414,828 |
| hexapdf CS           |  1,084ms |  30,920KiB |  11,053,330 |
| hexapdf CSP          |  8,176ms |  46,508KiB |  11,037,045 |
| origami              |  2,231ms |  84,284KiB |  11,479,697 |
| combine_pdf          | 17,444ms | 126,340KiB |  11,496,848 |
| pdftk C?             |    531ms |  69,200KiB |  11,501,669 |
| qpdf C               |    585ms |  11,796KiB |  11,500,308 |
| qpdf CS              |    701ms |  12,128KiB |  11,124,779 |
| smpdf CSP            |  3,411ms |  55,272KiB |  11,092,428 |
|------------------------------------------------------------|

|------------------------------------------------------------|
| c.pdf (14,399,980)   |     Time |     Memory |   File size |
|------------------------------------------------------------|
| hexapdf              |  1,890ms |  37,400KiB |  14,384,812 |
| hexapdf C            |  1,948ms |  37,548KiB |  14,349,167 |
| hexapdf CS           |  2,138ms |  43,516KiB |  13,182,415 |
| hexapdf CSP          |  9,142ms |  64,620KiB |  13,108,335 |
| origami              |  4,451ms | 144,032KiB |  14,338,614 |
| combine_pdf          |  3,355ms | 123,100KiB |  14,147,976 |
| pdftk C?             |  1,710ms | 104,716KiB |  14,439,611 |
| qpdf C               |  1,651ms |  35,132KiB |  14,432,647 |
| qpdf CS              |  1,972ms |  35,308KiB |  13,228,102 |
| smpdf CSP            |  3,099ms |  84,184KiB |  13,076,598 |
|------------------------------------------------------------|

|------------------------------------------------------------|
| d.pdf (8,107,348)    |     Time |     Memory |   File size |
|------------------------------------------------------------|
| hexapdf              |  5,122ms |  63,148KiB |   7,774,816 |
| hexapdf C            |  5,169ms |  59,568KiB |   7,036,578 |
| hexapdf CS           |  5,737ms |  60,628KiB |   6,530,348 |
| hexapdf CSP          |  6,040ms |  85,796KiB |   5,588,672 |
| origami              | 10,254ms | 132,104KiB |   7,499,298 |
| combine_pdf          |  5,291ms | 160,212KiB |   7,243,117 |
| pdftk C?             |  2,383ms | 105,048KiB |   7,279,035 |
| qpdf C               |  3,125ms |  40,700KiB |   7,209,305 |
| qpdf CS              |  3,223ms |  40,668KiB |   6,703,374 |
| smpdf CSP            |  3,234ms |  81,036KiB |   5,528,352 |
|------------------------------------------------------------|

|------------------------------------------------------------|
| e.pdf (21,788,087)   |     Time |     Memory |   File size |
|------------------------------------------------------------|
| hexapdf              |    905ms |  44,604KiB |  21,784,709 |
| hexapdf C            |  1,075ms |  96,776KiB |  21,850,676 |
| hexapdf CS           |  1,102ms | 100,716KiB |  21,768,926 |
| hexapdf CSP          | 36,121ms | 192,000KiB |  21,204,822 |
| origami              |  1,866ms | 143,000KiB |  21,800,148 |
| ERR combine_pdf      |      0ms |       0KiB |           0 |
| pdftk C?             |    811ms | 123,356KiB |  21,874,883 |
| qpdf C               |  1,062ms |  64,420KiB |  21,802,439 |
| qpdf CS              |  1,069ms |  64,716KiB |  21,787,558 |
| smpdf CSP            | 38,258ms | 653,760KiB |  21,188,516 |
|------------------------------------------------------------|

|------------------------------------------------------------|
| f.pdf (154,752,614)  |     Time |     Memory |   File size |
|------------------------------------------------------------|
| hexapdf              | 53,649ms | 482,320KiB | 154,077,468 |
| hexapdf C            | 57,479ms | 507,236KiB | 153,949,744 |
| hexapdf CS           | 62,726ms | 585,596KiB | 117,647,855 |
| ERR hexapdf CSP      |      0ms |       0KiB |           0 |
| ERR origami          |      0ms |       0KiB |           0 |
| ERR combine_pdf      |      0ms |       0KiB |           0 |
| pdftk C?             | 33,356ms | 673,620KiB | 157,850,354 |
| qpdf C               | 36,532ms | 485,120KiB | 157,723,936 |
| qpdf CS              | 40,771ms | 487,520KiB | 118,114,521 |
| ERR smpdf CSP        |      0ms |       0KiB |           0 |
|------------------------------------------------------------|

Result summary:

  • HexaPDF produced the smallest PDF in two cases and was second in the other four cases where smpdf was the best compressor. HexaPDF in CSP mode and smpdf can be considered equal because of the margin differences in file sizes.
  • When page compression is activated, HexaPDF is much slower when processing big content streams but this is expected.
  • HexaPDF is only up to 2.5x slower than pdftk which was the fastest (except for the a.pdf file) when page compression is not activated. This is rather good considering HexaPDF is written in Ruby while pdftk is a compiled binary.
  • Memory usage has been significantly reduced by not applying stream filters where possible. In the best possible case HexaPDF now just copies the data from the input IO directly to the output IO. This can be seen in the case of the non-compacting hexapdf script for file e.pdf, going from originally ~140MB to ~40MB!
  • In all cases HexaPDF uses much less memory than the other Ruby based solutions.
  • The a.pdf test case uses a very small file, so the initial startup time for the Ruby VM together with loading Rubygems is a big part of the overall runtime.
  • The f.pdf test case uses a very big file. Therefore HexaPDF in CSP mode, origami, combine_pdf and smpdf were killed after about 2 minutes since they would have taken a long time to finish.

Overall HexaPDF fares quite well in the benchmark in terms of speed and memory use!

You might also want to visit https://gettalong.org/blog/2016/hexapdf-performance-benchmark.html and https://gettalong.org/blog/2016/ruby24-performance-looking-good.html where I have written a bit more about this benchmark.

#/bin/bash
OUT_FILE=/tmp/bench-result.pdf
trap exit 2
function bench_file() {
cmdname=$1
FORMAT="| %-20s | %'6ims | %'7iKiB | %'11i |\n"
shift
time=$(date +%s%N)
/usr/bin/time -f '%M' -o /tmp/bench-times "$@" &>/dev/null
if [ $? -ne 0 ]; then
cmdname="ERR ${cmdname}"
time=0
mem_usage=0
file_size=0
else
time=$(( ($(date +%s%N)-time)/1000000 ))
mem_usage=$(cat /tmp/bench-times)
file_size=$(stat -c '%s' $OUT_FILE)
fi
printf "$FORMAT" "$cmdname" "$time" "$mem_usage" "$file_size"
}
cd $(dirname $0)
FILES=(*.pdf)
if [ $# -ne 0 ]; then FILES=("$@"); fi
for file in "${FILES[@]}"; do
file_size=$(printf "%'i" $(stat -c '%s' "$file"))
echo "|------------------------------------------------------------|"
printf "| %-20s | Time | Memory | File size |\n" "$file ($file_size)"
echo "|------------------------------------------------------------|"
bench_file "hexapdf " ruby -I../../lib ../../bin/hexapdf -f optimize "${file}" --no-compact --object-streams=preserve --xref-streams=preserve --streams=preserve --no-optimize-fonts ${OUT_FILE}
bench_file "hexapdf C" ruby -I../../lib ../../bin/hexapdf -f optimize "${file}" --compact --object-streams=preserve --xref-streams=preserve --streams=preserve --no-optimize-fonts ${OUT_FILE}
bench_file "hexapdf CS" ruby -I../../lib ../../bin/hexapdf -f optimize "${file}" ${OUT_FILE}
bench_file "hexapdf CSP" ruby -I../../lib ../../bin/hexapdf -f optimize "${file}" --compress-pages ${OUT_FILE}
bench_file origami ruby origami.rb "${file}" ${OUT_FILE}
bench_file combine_pdf ruby combine_pdf.rb "${file}" ${OUT_FILE}
bench_file "pdftk C?" pdftk "${file}" output ${OUT_FILE}
bench_file "qpdf C" qpdf "${file}" ${OUT_FILE}
bench_file "qpdf CS" qpdf "${file}" --object-streams=generate ${OUT_FILE}
bench_file "smpdf CSP" smpdf "${file}" -o ${OUT_FILE}
echo "|------------------------------------------------------------|"
echo
done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment