Created
September 21, 2013 13:20
-
-
Save hrbrmstr/6650537 to your computer and use it in GitHub Desktop.
Small python script to extract (print to stdout) HREFs from a web page. It fills in the base URL prefix for relative URLs and supports shoving as many URLs on the command line as your interpreter/python allows for command line args. Lightweight faking of user agent for more restrictive sites.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/python | |
import urllib2 | |
from urlparse import urljoin | |
import sys | |
from bs4 import BeautifulSoup | |
if len(sys.argv) < 2: | |
print "Usage:\n hrefs url [url] ..." | |
sys.exit(1) | |
for URL in sys.argv[1:]: | |
headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' } | |
req = urllib2.Request(URL, None, headers) | |
soup = BeautifulSoup(urllib2.urlopen(req)) | |
links = soup.find_all('a') | |
for link in links: | |
print urljoin(URL,link.get('href')) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment