Skip to content

Instantly share code, notes, and snippets.

@amontalenti
Last active October 6, 2021 15:44
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save amontalenti/7291292 to your computer and use it in GitHub Desktop.
Save amontalenti/7291292 to your computer and use it in GitHub Desktop.
Simple script that uses BeautifulSoup, requests, and urlparse to spider a sitemap.xml file (CNN used as example)
import os
import requests
from BeautifulSoup import BeautifulSoup
from urlparse import urlparse
sitemap_xml = "http://www.cnn.com/sitemaps/sitemap-specials-2013-11.xml"
sitemap_response = requests.get(sitemap_xml)
soup = BeautifulSoup(sitemap_response.content)
elements = soup.findAll("url")
urls = [elem.find("loc").string for elem in elements]
for url in urls:
parsed = urlparse(url)
# group all files from single domain in same folder
folder = parsed.netloc
# replace "/" with "__" so that files can work on-disk
file = parsed.path.replace("/", "__")
print "Downloading {url} to {folder}/{file}".format(
url=url, folder=folder, file=file)
try:
os.mkdir(folder)
except:
pass
resp = requests.get(url)
with open(folder + "/" + file, "wb") as output:
output.write(resp.content)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment