Skip to content

Instantly share code, notes, and snippets.

@Florents-Tselai
Last active August 29, 2015 14:19
Show Gist options
  • Save Florents-Tselai/70638c412b65a5ef5def to your computer and use it in GitHub Desktop.
Save Florents-Tselai/70638c412b65a5ef5def to your computer and use it in GitHub Desktop.
#!/usr/bin/env python
import urllib2
import re
import sys
from itertools import imap, ifilter
if __name__ == "__main__":
in_url = sys.argv[1]
current_dir = "/".join(in_url.split('/')[:-1])
all_urls = re.findall(r'href=[\'"]?([^\'" >]+)', urllib2.urlopen(in_url).read())
print "\n".join(imap(lambda x: x if x.startswith('http') else current_dir + "/" + x, ifilter(lambda x: x.endswith('pdf'),all_urls)))
@Florents-Tselai
Copy link
Author

# Download all .pdf files referenced in a web page
python scrap_pdf.py http://www.vldb.org/pvldb/vol8.html | parallel wget {}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment