Skip to content

Instantly share code, notes, and snippets.

@xbns
Last active June 6, 2019 09:49
Show Gist options
  • Save xbns/fc1ca5c30adf062422ad203d5c0a0d31 to your computer and use it in GitHub Desktop.
Save xbns/fc1ca5c30adf062422ad203d5c0a0d31 to your computer and use it in GitHub Desktop.
#parallel #scraping pdfs
from bs4 import BeautifulSoup
import requests
r = requests.get("https://aws.amazon.com/whitepapers/")
data = r.text
soup = BeautifulSoup(data,"lxml")
for link in soup.findAll('a',href=True):
#skip all other liks except pdf ones
if not link['href'].endswith('pdf'):
continue
print(link.get('href'))
##usage
# $ python download-pdfs.py >aws-whitepapers.txt
# then..
# $ parallel -j 20 --gnu -a aws-whitepapers.txt wget -nc
# -nc,--no-clobber: skip downloads that would download to existing files
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment