Skip to content

Instantly share code, notes, and snippets.

@serv-inc
Created December 11, 2018 06:13
Show Gist options
  • Save serv-inc/0405594483a4115233f47ab19cfbf3b2 to your computer and use it in GitHub Desktop.
Save serv-inc/0405594483a4115233f47ab19cfbf3b2 to your computer and use it in GitHub Desktop.
'''loads pdf file in sys.argv[1], extracts URLs, tries to load each URL'''
import urllib
import sys
import PyPDF2
# credits to stackoverflow.com/questions/27744210
def extract_urls(filename):
'''extracts all urls from filename'''
PDFFile = open(filename,'rb')
PDF = PyPDF2.PdfFileReader(PDFFile)
pages = PDF.getNumPages()
key = '/Annots'
uri = '/URI'
ank = '/A'
for page in range(pages):
pageSliced = PDF.getPage(page)
pageObject = pageSliced.getObject()
if pageObject.has_key(key):
ann = pageObject[key]
for a in ann:
u = a.getObject()
if u[ank].has_key(uri):
yield u[ank][uri]
def check_http_url(url):
urllib.urlopen(url)
if __name__ == "__main__":
for url in extract_urls(sys.argv[1]):
check_http_url(url)
@jsbien
Copy link

jsbien commented Apr 9, 2023

On Debian I have problems with installing PyPDF2. I give up for some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment