Skip to content

Instantly share code, notes, and snippets.

@bartoszek
Created December 28, 2020 16:53
Show Gist options
  • Save bartoszek/37ddfd7bc1b251416e8aaa4f044517ba to your computer and use it in GitHub Desktop.
Save bartoszek/37ddfd7bc1b251416e8aaa4f044517ba to your computer and use it in GitHub Desktop.
Unscramble pdf protected with costume font.
#!/bin/bash
#depends
[[ $# != 1 ]] && { echo "useage: $(basename $0) pdf_file" >&2; exit 10; }
for dep in pdftoppm tesseract pdfunite; do
hash "$dep" || { echo "requires: $dep" >&2; exit 11; }
done
#tmp
tmp=$(mktemp -d)
trap "rm -rf $tmp" EXIT
#pdf->png
echo "Resterizing ..." >&2
pdftoppm -png "$1" "$tmp/${1%.pdf}" 2>&1
echo "OCRing ..." >&2
#png->pdf
imgs=("$tmp/${1%.pdf}"*.png)
for img in "${imgs[@]}"; do
echo -en "Page: $((++i))/${#imgs[@]}\r" >&2
tesseract -l pol --psm 1 --oem 1 "$img" "${img%.png}" pdf 2>&1
done
#concat pdfs
echo "Concating ..." >&2
pdfunite "$tmp/${1%.pdf}"*.pdf "${1%.pdf}".copy.pdf 2>&
@bartoszek
Copy link
Author

bartoszek commented Dec 28, 2020

When you grab a pdf which uses a costume font to prevent text copy, you can quickly unscramble it with tesseract OCR.
Please note -l pol force tesseract language (requires tesseract language pack installed, replace with language you need)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment