Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Download all the pdf files linked in a given webpage.
#!/usr/bin/env python
"""
Download all the pdfs linked on a given webpage
Usage -
python grab_pdfs.py url <path/to/directory>
url is required
path is optional. Path needs to be absolute
will save in the current directory if no path is given
will save in the current directory if given path does not exist
Requires - requests >= 1.0.4
beautifulsoup >= 4.0.0
Download and install using
pip install requests
pip install beautifulsoup4
"""
__author__= 'elssar <elssar@altrawcode.com>'
__license__= 'MIT'
__version__= '1.0.0'
from requests import get
from urlparse import urljoin
from os import path, getcwd
from bs4 import BeautifulSoup as soup
from sys import argv
def get_page(base_url):
req= get(base_url)
if req.status_code==200:
return req.text
raise Exception('Error {0}'.format(req.status_code))
def get_all_links(html):
bs= soup(html)
links= bs.findAll('a')
return links
def get_pdf(base_url, base_dir):
html= get_page()
links= get_all_links(html)
if len(links)==0:
raise Exception('No links found on the webpage')
n_pdfs= 0
for link in links:
if link['href'][-4:]=='.pdf':
n_pdfs+= 1
content= get(urljoin(base_url, link['href']))
if content.status==200 and content.headers['content-type']=='application/pdf':
with open(path.join(base_dir, link.text+'.pdf'), 'wb') as pdf:
pdf.write(content.content)
if n_pdfs==0:
raise Exception('No pdfs found on the page')
print "{0} pdfs downloaded and saved in {1}".format(n_pdfs, base_dir)
if __name__=='__main__':
if len(argv) not in (2, 3):
print 'Error! Invalid arguments'
print __doc__
exit(-1)
arg= ''
url= argv[1]
if len(argv)==3:
arg= argv[2]
base_dir= [getcwd(), arg][path.isdir(arg)]
try:
get_pdf(base_dir)
except Exception, e:
print e
exit(-1)
@xorn

This comment has been minimized.

Copy link

@xorn xorn commented Nov 3, 2016

I get pdf_get() requires exactly 2 args

@YerongLi

This comment has been minimized.

Copy link

@YerongLi YerongLi commented Mar 22, 2017

Me too.

@Ompha

This comment has been minimized.

Copy link

@Ompha Ompha commented Apr 3, 2017

I tried the following modification which solved the problem "pdf_get() requires exactly 2 args":
Change line 41 to html= get_page(base_url)
Change line 68 to get_pdf(url ,base_dir)

However, the script gives new error "An exception has occurred, use %tb to see the full traceback.
SystemExit: -1".
I traced back the error but cannot find a solution to get this working.
Helps will be appreciated. Thanks.

@sharmasubash

This comment has been minimized.

Copy link

@sharmasubash sharmasubash commented Aug 31, 2017

Nice Code, Worked like a charm! Couple of tweaks and i was able to download all the pdf files.

@aribaris

This comment has been minimized.

Copy link

@aribaris aribaris commented Oct 14, 2017

I also get the "pdf_get() requires exactly 2 args" error whatever I do.

@Felipe-UnB

This comment has been minimized.

Copy link

@Felipe-UnB Felipe-UnB commented Nov 15, 2017

If I got it right, the point about the "2 args" error would be an approach to test if both arguments, base_url and base_dir, were present in the call of the function? But it is strange, Python would immediately rise an exception if we try to run this code without providing the arguments. I did some modifications to this code and it is running.

https://gist.github.com/Felipe-UnB/5c45ea5a8a7910b35dc31fbc750dad58

@danny311296

This comment has been minimized.

Copy link

@danny311296 danny311296 commented Dec 22, 2017

The easiest solution to this is to just use the wget command on the terminal
For example:
wget -r -P ./pdfs -A pdf http://kea.kar.nic.in/

@ade-adisa

This comment has been minimized.

Copy link

@ade-adisa ade-adisa commented May 22, 2018

@danny311296 Your code returns an error

>>> wget -r -P ./pdfs -A pdf http://kea.kar.nic.in/
  File "<stdin>", line 1
    wget -r -P ./pdfs -A pdf http://kea.kar.nic.in/
                ^
SyntaxError: invalid syntax
@danny311296

This comment has been minimized.

Copy link

@danny311296 danny311296 commented Aug 25, 2018

@adisain It should work on Ubuntu and most Unix systems.

Maybe try
wget -r -P pdfs -A pdf http://kea.kar.nic.in/
instead on other systems

@jQwotos

This comment has been minimized.

Copy link

@jQwotos jQwotos commented Apr 18, 2019

Thanks @danny311296

@gayathrigummaraju

This comment has been minimized.

Copy link

@gayathrigummaraju gayathrigummaraju commented May 11, 2019

@danny, the command works amazingly for the above website, Please for the website https://nclt.gov.in it is throwing cannot verify certificate error, so should i try python codes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.