Skip to content

Instantly share code, notes, and snippets.

@reubano
Forked from ndunn219/sitemap_checker.py
Last active November 15, 2017 20:37
Show Gist options
  • Save reubano/92b0bc1131d8284a2a3b46a3406fa3a6 to your computer and use it in GitHub Desktop.
Save reubano/92b0bc1131d8284a2a3b46a3406fa3a6 to your computer and use it in GitHub Desktop.
This code shows how to check a sitemap to make sure there are no links pointing to missing pages and to see if 301s are working correctly. It is explained at https://www.webucator.com/blog/2016/05/checking-your-sitemap-for-broken-links-with-python/
import requests
from bs4 import BeautifulSoup
sitemap = 'http://www.nasa.gov/sitemap/sitemap_nasa.html'
r = requests.get(sitemap)
html = r.content
soup = BeautifulSoup(html, 'html.parser')
links = soup.find_all('a')
hrefs = filter(None, (link.get('href') for link in links))
urls = (href for href in hrefs if href.startswith('http'))
reqs = (requests.head(url) for url in external)
results = sorted(reqs, key=lambda r: (r.status_code, len(r.history)))
print('\n==========\OK')
for r in results:
if r.ok and not r.history:
print(' - '.join([str(r.status_code), r.reason, r.url]))
print('\n==========\REDIRECTS')
for r in results:
if r.ok and r.history:
print('{} redirected'.format(r.url))
for response in result['history']:
print('>> Redirect to {}'.format(response.url))
print('\n==========\nERRORS')
for r in results:
if not r.ok:
print(' - '.join([str(r.status_code), r.reason, r.url]))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment