Skip to content

Instantly share code, notes, and snippets.

@stefanschmidt
Created July 8, 2015 21:43
Show Gist options
  • Save stefanschmidt/9690b4671966c11860d8 to your computer and use it in GitHub Desktop.
Save stefanschmidt/9690b4671966c11860d8 to your computer and use it in GitHub Desktop.
Find raw line numbers of text snippets in PDFs with manual kerning
# qpdf converts the PDF into a readable text file
# depends on GNU sed and GNU grep (available via Homebrew)
# GNU sed because BSD grep throws an "illegal byte sequence" error
# GNU grep because BSD grep omits lines
# iconv to properly display non-UTF8 encodings
# doesn't find hyphenated words or words with ligatures
qpdf --qdf doc.pdf - | iconv -f ISO-8859-1 -t UTF-8 | gsed -E 's/\)-?[0-9]+(\.[0-9]+)?\(//g' | ggrep -ian foobar
# usage example
#
# curl -o doc.pdf http://www.diejungeakademie.de/fileadmin/user_upload/Bilder/Ueber_uns/geschaeftsstelle/JA_Magazin_18_RZ_Ansicht.pdf
# $ qpdf --qdf doc.pdf - | iconv -f ISO-8859-1 -t UTF-8 | ggrep -ian erschließung
# 19679:[(K)47(onkret)15(e )50(Themen waren die Erschließung unbekannt)15(er Räume )]TJ
# $ qpdf --qdf doc.pdf - | iconv -f ISO-8859-1 -t UTF-8 | ggrep -ian 'erschließung unbekannter'
# $ qpdf --qdf doc.pdf - | iconv -f ISO-8859-1 -t UTF-8 | gsed -E 's/\)-?[0-9]+(\.[0-9]+)?\(//g' | ggrep -ian 'erschließung unbekannter'
# 19679:[(Konkrete Themen waren die Erschließung unbekannter Räume )]TJ
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment