Skip to content

Instantly share code, notes, and snippets.

@mshmsh5000
Created June 16, 2012 02:01
Show Gist options
  • Save mshmsh5000/2939602 to your computer and use it in GitHub Desktop.
Save mshmsh5000/2939602 to your computer and use it in GitHub Desktop.
Sitemap-driven crawler
import xml.dom.minidom, urllib
def getText(nodelist):
rc = []
for node in nodelist:
if node.nodeType == node.TEXT_NODE:
rc.append(node.data)
return ''.join(rc)
dom = xml.dom.minidom.parseString(urllib.urlopen('http://www.lesters.com/sitemap/sitemap.xml').read())
locs = dom.getElementsByTagName('loc')
for loc in locs:
url = getText(loc.childNodes)
urllib.urlretrieve(url)
print url
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment