Skip to content

Instantly share code, notes, and snippets.

@Proteusiq
Created May 8, 2021 06:35
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Proteusiq/b84af56a3577b4b70c51ea20b2822776 to your computer and use it in GitHub Desktop.
Save Proteusiq/b84af56a3577b4b70c51ea20b2822776 to your computer and use it in GitHub Desktop.
# Using pyPDF2 and requests(or httpx) to extract PDF data
import io
import requests
import PyPDF2
# my favorite Kierkegard's PDF book
URI = "https://antilogicalism.com/wp-content/uploads/2017/07/thesicknessuntodeath.pdf"
headers = {"user-agent": "Prayson W. Daniel: prayson*at*.com"}
# get online PDF, and extract text data
r = requests.get(URI, headers=headers)
with io.BytesIO(r.content) as f:
reader = PyPDF2.PdfFileReader(f)
num_pages = reader.numPages
data_store = []
# place page text to data
for page in range(num_pages):
page_data = reader.getPage(page)
data_store.append(page_data.extractText())
# consume data in NLP pipeline
for page in data_store:
# do awesome things
print(page)
print("\n")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment