Skip to content

Instantly share code, notes, and snippets.

@Sinkmanu
Created June 23, 2016 12:15
Show Gist options
  • Save Sinkmanu/185149778cb540ee260a8368c685e508 to your computer and use it in GitHub Desktop.
Save Sinkmanu/185149778cb540ee260a8368c685e508 to your computer and use it in GitHub Desktop.
Read and parse the sitemap of a site.
#!/usr/bin/env python
import requests
from xml.etree.ElementTree import XML, SubElement, Element, tostring
import sys
from bs4 import BeautifulSoup
url = sys.argv[1]
user_agent = { 'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0' }
try:
requests.packages.urllib3.disable_warnings()
r = requests.get(url, verify=False, headers=user_agent)
soup = BeautifulSoup(r.text, "html5lib")
urls = soup.findAll('url')
for link in urls:
print link.find('loc').string
except Exception as e:
print "[ERR] %s"%str(e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment