Skip to content

Instantly share code, notes, and snippets.

@junkblocker
Forked from mbafford/README.md
Created July 4, 2024 15:57
Show Gist options
  • Save junkblocker/ec5bf3fb94869bdd8ba2f30f3e33afcd to your computer and use it in GitHub Desktop.
Save junkblocker/ec5bf3fb94869bdd8ba2f30f3e33afcd to your computer and use it in GitHub Desktop.
Compare two PDFs using ImageMagick - provides a visual comaprison and a perceptual hash comparison (numerical)

PDF tools for comparing PDFs visually (overlaying two PDFs to see changed areas) and using a perceptual hash (numerical value indicating visual difference between the two files).

Useful for command line review of PDFs and de-duplication. Configure git to use these tools for better PDF history / comparison in git.

These scripts require imagemagick and poppler. Both installed from homebrew.


Setup git to use a custom diff using:

.gitattributes:

*.pdf binary diff=pdf

.gitconfig:

[diff "pdf"]
    ; textconv = ~/bin/pdf2layout
    command = ~/bin/git-diff-pdf
#!/bin/bash
if [[ -z "$1" || -z "$2" ]]; then
echo "Usage: $0 <pdf1> <pdf2>"
exit 1
fi
echo "comparing [$1] and [$2]"
# pdf2layout from poppler on homebrew (brew install poppler)
echo "*** text content"
diff <(~/bin/pdf2layout "$1") <(~/bin/pdf2layout "$2")
echo "*** image perceptual hash"
~/bin/pdf-compare-phash "$1" "$2"
#!/bin/bash
if [[ -z "$1" || -z "$2" ]]; then
echo "Usage: $0 <pdf1> <pdf2>"
exit 1
fi
convert -metric phash "$1" null: "$2" -compose Difference -layers composite -format '%[fx:mean]\n' info:
#!/bin/bash
if [[ -z "$1" || -z "$2" ]]; then
echo "Usage: $0 <pdf1> <pdf2>"
exit 1
fi
TMP=$(mktemp --suffix=.png)
echo "Comparing [$1] to [$2]"
echo "Saving difference in $TMP"
echo
DENSITY=100
# this supports both simple file names and page indexed file names like:
# file[0] file[1] - will either return one line for each page, or a single
# line if a single page is specified
PAGES1=$(magick identify "$1" | wc -l)
PAGES2=$(magick identify "$2" | wc -l)
if (( PAGES1 != PAGES2 )); then
echo "Number of pages between documents does not match: $PAGES1 != $PAGES2"
echo "Only comparing the first page."
magick compare -density "$DENSITY" -background white "$1[0]" "$2[0]" "$TMP"
PHASH_DIFF=$(~/bin/pdf-compare-phash "$1[0]" "$2[0]")
elif (( PAGES1 > 5 )); then
echo "Too many pages ($PAGES1 > 5) to create hyper-image with all pages."
echo "Only comparing first page."
magick compare -density "$DENSITY" -background white "$1[0]" "$2[0]" "$TMP"
PHASH_DIFF=$(~/bin/pdf-compare-phash "$1[0]" "$2[0]")
else
# convert the PDFs into a single image with the pages vertically stacked
ALL1=$(mktemp --suffix=.png)
magick convert -density "$DENSITY" "$1" -append "$ALL1"
ALL2=$(mktemp --suffix=.png)
magick convert -density "$DENSITY" "$2" -append "$ALL2"
magick compare -density "$DENSITY" -background white "$ALL1" "$ALL2" "$TMP"
PHASH_DIFF=$(~/bin/pdf-compare-phash "$ALL1" "$ALL2")
fi
if [ "$TERM_PROGRAM" = "iTerm.app" ]; then
echo "Visual difference between images:"
echo "--------------------------------"
imgcat-small "$TMP"
echo "--------------------------------"
else
open "$TMP"
fi
echo "Perceptual hash difference (0 is exactly the same): $PHASH_DIFF"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment