Skip to content

Instantly share code, notes, and snippets.

@elssar
Created March 14, 2013 11:48
Show Gist options
  • Star 22 You must be signed in to star a gist
  • Fork 23 You must be signed in to fork a gist
  • Save elssar/5160757 to your computer and use it in GitHub Desktop.
Save elssar/5160757 to your computer and use it in GitHub Desktop.
Download all the pdf files linked in a given webpage.
#!/usr/bin/env python
"""
Download all the pdfs linked on a given webpage
Usage -
python grab_pdfs.py url <path/to/directory>
url is required
path is optional. Path needs to be absolute
will save in the current directory if no path is given
will save in the current directory if given path does not exist
Requires - requests >= 1.0.4
beautifulsoup >= 4.0.0
Download and install using
pip install requests
pip install beautifulsoup4
"""
__author__= 'elssar <elssar@altrawcode.com>'
__license__= 'MIT'
__version__= '1.0.0'
from requests import get
from urlparse import urljoin
from os import path, getcwd
from bs4 import BeautifulSoup as soup
from sys import argv
def get_page(base_url):
req= get(base_url)
if req.status_code==200:
return req.text
raise Exception('Error {0}'.format(req.status_code))
def get_all_links(html):
bs= soup(html)
links= bs.findAll('a')
return links
def get_pdf(base_url, base_dir):
html= get_page()
links= get_all_links(html)
if len(links)==0:
raise Exception('No links found on the webpage')
n_pdfs= 0
for link in links:
if link['href'][-4:]=='.pdf':
n_pdfs+= 1
content= get(urljoin(base_url, link['href']))
if content.status==200 and content.headers['content-type']=='application/pdf':
with open(path.join(base_dir, link.text+'.pdf'), 'wb') as pdf:
pdf.write(content.content)
if n_pdfs==0:
raise Exception('No pdfs found on the page')
print "{0} pdfs downloaded and saved in {1}".format(n_pdfs, base_dir)
if __name__=='__main__':
if len(argv) not in (2, 3):
print 'Error! Invalid arguments'
print __doc__
exit(-1)
arg= ''
url= argv[1]
if len(argv)==3:
arg= argv[2]
base_dir= [getcwd(), arg][path.isdir(arg)]
try:
get_pdf(base_dir)
except Exception, e:
print e
exit(-1)
@xorn
Copy link

xorn commented Nov 3, 2016

I get pdf_get() requires exactly 2 args

@YerongLi
Copy link

Me too.

@Ompha
Copy link

Ompha commented Apr 3, 2017

I tried the following modification which solved the problem "pdf_get() requires exactly 2 args":
Change line 41 to html= get_page(base_url)
Change line 68 to get_pdf(url ,base_dir)

However, the script gives new error "An exception has occurred, use %tb to see the full traceback.
SystemExit: -1".
I traced back the error but cannot find a solution to get this working.
Helps will be appreciated. Thanks.

@sharmasubash
Copy link

Nice Code, Worked like a charm! Couple of tweaks and i was able to download all the pdf files.

@aribaris
Copy link

I also get the "pdf_get() requires exactly 2 args" error whatever I do.

@Felipe-UnB
Copy link

Felipe-UnB commented Nov 15, 2017

If I got it right, the point about the "2 args" error would be an approach to test if both arguments, base_url and base_dir, were present in the call of the function? But it is strange, Python would immediately rise an exception if we try to run this code without providing the arguments. I did some modifications to this code and it is running.

https://gist.github.com/Felipe-UnB/5c45ea5a8a7910b35dc31fbc750dad58

@dannyi96
Copy link

The easiest solution to this is to just use the wget command on the terminal
For example:
wget -r -P ./pdfs -A pdf http://kea.kar.nic.in/

@ade-adisa
Copy link

@danny311296 Your code returns an error

>>> wget -r -P ./pdfs -A pdf http://kea.kar.nic.in/
  File "<stdin>", line 1
    wget -r -P ./pdfs -A pdf http://kea.kar.nic.in/
                ^
SyntaxError: invalid syntax

@dannyi96
Copy link

dannyi96 commented Aug 25, 2018

@Adisain It should work on Ubuntu and most Unix systems.

Maybe try
wget -r -P pdfs -A pdf http://kea.kar.nic.in/
instead on other systems

@jQwotos
Copy link

jQwotos commented Apr 18, 2019

Thanks @danny311296

@gayathrigummaraju
Copy link

@danny, the command works amazingly for the above website, Please for the website https://nclt.gov.in it is throwing cannot verify certificate error, so should i try python codes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment