Skip to content

Instantly share code, notes, and snippets.

@hrbrmstr
Created September 21, 2013 13:20
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save hrbrmstr/6650537 to your computer and use it in GitHub Desktop.
Save hrbrmstr/6650537 to your computer and use it in GitHub Desktop.
Small python script to extract (print to stdout) HREFs from a web page. It fills in the base URL prefix for relative URLs and supports shoving as many URLs on the command line as your interpreter/python allows for command line args. Lightweight faking of user agent for more restrictive sites.
#!/usr/bin/python
import urllib2
from urlparse import urljoin
import sys
from bs4 import BeautifulSoup
if len(sys.argv) < 2:
print "Usage:\n hrefs url [url] ..."
sys.exit(1)
for URL in sys.argv[1:]:
headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' }
req = urllib2.Request(URL, None, headers)
soup = BeautifulSoup(urllib2.urlopen(req))
links = soup.find_all('a')
for link in links:
print urljoin(URL,link.get('href'))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment