Skip to content

Instantly share code, notes, and snippets.

@phiresky
Created May 29, 2020 19:12
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save phiresky/bb51d2e6712c0f160d6fb7594eecf9f9 to your computer and use it in GitHub Desktop.
Save phiresky/bb51d2e6712c0f160d6fb7594eecf9f9 to your computer and use it in GitHub Desktop.
ripgrep pdf text extractor with caching that is much faster than pdfgrep
#!/bin/bash
# usage: `rg --no-line-number --sort-files --pre pdfextract "$@"`
# better and much faster solution: https://github.com/phiresky/ripgrep-all
fname="$1"
cachedir=/tmp/pdfextract
mkdir -p "$cachedir"
mtime="$(stat -c %Y "$1")"
hash=$(echo $fname.$mtime | sha256sum | cut -c1-64)
echo $hash $fname $mtime
cachefname="$cachedir/$hash.txt"
if [[ ! -f "$cachefname" ]]; then
pdftotext -layout "$fname" - |
# add "Page X: " prefix to each line
awk 'BEGIN {page=1} /\f/{page+=1}; { sub(/\f/, ""); print "Page " page ":", $0}' > "$cachefname"
fi
exec cat "$cachefname"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment