Skip to content

Instantly share code, notes, and snippets.

@jbarciauskas
Created February 1, 2011 18:44
Show Gist options
  • Save jbarciauskas/806362 to your computer and use it in GitHub Desktop.
Save jbarciauskas/806362 to your computer and use it in GitHub Desktop.
Crawl urls loaded from a file and print the links found on them
#!/usr/bin/python
import sys
import urllib
from BeautifulSoup import BeautifulSoup, SoupStrainer
import re
filename = sys.argv[1]
with open(filename) as f:
for line in f:
print "Opening " + line
html = urllib.urlopen(line).read()
linksToBob = SoupStrainer('a', re.compile('bob.com/'))
soup = BeautifulSoup.BeautifulSoup(html,parseOnlyThis=linksToBob)
for tag in soup:
try:
print(tag['href'].encode('latin-1'))
except KeyError:
pass
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment