Skip to content

Instantly share code, notes, and snippets.

@jasoncartwright
Last active February 28, 2023 09:15
Show Gist options
  • Save jasoncartwright/3d42d177dc41f6250b89ff3daed8168c to your computer and use it in GitHub Desktop.
Save jasoncartwright/3d42d177dc41f6250b89ff3daed8168c to your computer and use it in GitHub Desktop.
Hit all URLs in a sitemap and report any non-200 responses
import requests, re
# PUT YOUR SITEMAP URL RIGHT HERE PLEASE
SITEMAP_URL = "https://www.example.com/sitemap.xml"
# REPORT BACK EVERY X URLS DOWNLOADED
REPORT_EVERY = 100
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:100.0) Gecko/20100101 Firefox/100.0",
}
print("Getting %s" % (SITEMAP_URL))
sitemap_text = requests.get(SITEMAP_URL, headers=headers).text
urls = re.findall("<loc>(.*?)</loc>", sitemap_text)
number_of_urls = len(urls)
print("Checking URLs...")
url_number = 0
for url in urls:
request = requests.get(url, headers=headers)
if request.status_code != 200:
print("ERROR %s %s" % (str(request.status_code), url))
if url_number % REPORT_EVERY == 0:
print("Checked %s of %s URLs" % (url_number, number_of_urls))
url_number += 1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment