Skip to content

Instantly share code, notes, and snippets.

@WindfallLabs
Created July 25, 2019 03:13
Show Gist options
  • Save WindfallLabs/3e4cac818e9560ea57aeea95dc3211a7 to your computer and use it in GitHub Desktop.
Save WindfallLabs/3e4cac818e9560ea57aeea95dc3211a7 to your computer and use it in GitHub Desktop.
Reads a page from a PDF by page number and cleans the contents
import re
import string
import PyPDF2
def clean_pdf_text(pdfobj, page_num):
# Get text from PDF object by page number
page_text = pdfobj.getPage(page_num).extractText()
# Remove double-spaces
page_text = page_text.replace(" ", " ")
# Remove non-ASCII chars
page_text = filter(
lambda x: x in set(string.printable), page_text)
# Remove page-broken hyphens and all newlines
page_text = page_text.replace("-\n", "").replace("\n", "")
# Split at InDesign filename and remove .indd
terminus = "\d+_.*_.*\.indd"
page_text = re.sub(terminus, "", re.split(terminus, page_text)[0])
return page_text
# Example
pdf = PyPDF2.PdfFileReader(open("my.pdf", "rb"))
print(clean_pdf_text(pdf, 5))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment