Skip to content

Instantly share code, notes, and snippets.

@larryxiao
Last active October 18, 2017 08:56
Show Gist options
  • Save larryxiao/5728624 to your computer and use it in GitHub Desktop.
Save larryxiao/5728624 to your computer and use it in GitHub Desktop.
extract text from pdf then remove unnecessary characters change '\n' into '||' change \f' into ' '
libreoffice --convert-to pdf *.ppt
libreoffice --headless --convert-to pdf *.ppt
20130607
CONVERT
EXTRACT
CLEANUP
libreoffice --convert-to pdf *.ppt
pdf2txt - extracts text contents of PDF files
pdftk
pdftk 1.pdf 2.pdf 3.pdf cat output merged.pdf
in alphabetical order: pdftk *.pdf cat output merged.pdf
#!/bin/bash
for f in *.txt
do
echo "Processing $f file... \"$f"
tr '\n' '||' < "$f" > "$f.temp"
tr '\f' ' ' < "$f.temp" > "$f.out"
mv "$f.out" ./out
rm "$f.temp"
# take action on each file. $f store current file name
# cat $f
done
#!/bin/bash
#FILES=./*.pdf
#Processing ./20130604202323560.pdf file... "output./20130604202323560.pdf
#for f in $FILES
#Processing 20130604202323560.pdf file... "output20130604202323560.pdf
for f in *.pdf
do
echo "Processing $f file... \"output$f.txt"
pdf2txt -o "output$f.txt" $f
# take action on each file. $f store current file name
# cat $f
done
@larryxiao
Copy link
Author

merge
pdftk *.pdf cat output merged.pdf

@larryxiao
Copy link
Author

@larryxiao
Copy link
Author

newer version pdf creator
remove optimization before processing

Uncompress PDF page streams for editing the PDF in a text editor (e.g., vim, emacs)
pdftk doc.pdf output doc.unc.pdf uncompress

http://www.pdflabs.com/docs/pdftk-cli-examples

not the solution

@larryxiao
Copy link
Author

http://stackoverflow.com/questions/10772686/pdftk-and-qpdf-to-reset-pdf-commenting-security

The command qpdf --decrypt input.pdf output.pdf removes the 'owner' password. But it does only work, if there is no 'user' password set.

Once the owner password is removed, the output.pdf should already have unset all security protection and have allowed commenting. Needless to run your extra pdftk ... command then... BTW, your allow paramenter in your pdftk call will not work the way you quoted your command. The allow permissions will only be applied if you also...

...either specify an encryption strength
...or give a user or an owner password
Try the following to find out the detailed security settings of the file(s):

qpdf --show-encryption input.pdf
qpdf --show-encryption output.pdf

@larryxiao
Copy link
Author

pdf2txt -o "merged.txt" merged.pdf

@larryxiao
Copy link
Author

@larryxiao
Copy link
Author

@larryxiao
Copy link
Author

@larryxiao
Copy link
Author

pdf 2 image (jpg/png) multiple pages
convert file.pdf result.png
-quality 100 -density 150

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment