Skip to content

Instantly share code, notes, and snippets.

@rickardlindberg
Created January 10, 2015 10:06
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rickardlindberg/062d259b98c590f7b46f to your computer and use it in GitHub Desktop.
Save rickardlindberg/062d259b98c590f7b46f to your computer and use it in GitHub Desktop.
This Python script is used to check a web page for broken links. I wrote it to be used in a workshop about continuous integration with Jenkins.
import re
import requests
import sys
import urlparse
TIMEOUT_IN_SECONDS = 10.0
def check(base_url):
print("Checking %s" % base_url)
base_response = requests.get(base_url, timeout=TIMEOUT_IN_SECONDS)
assert base_response.status_code == 200
assert base_response.headers["Content-Type"].startswith("text/")
for link_match in re.finditer(r"<a.*?href=\"(.*?)\"", base_response.text):
link_url = urlparse.urljoin(base_url, link_match.group(1))
if link_url.startswith("http"):
print(" %s" % link_url)
link_response = requests.get(link_url, timeout=TIMEOUT_IN_SECONDS)
assert link_response.status_code == 200
if __name__ == "__main__":
for base_url in sys.argv[1:]:
check(base_url)
@rickardlindberg
Copy link
Author

Examples:

python check.py http://gnome.org
python check.py http://google.com http://duckduckgo.com

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment