Skip to content

Instantly share code, notes, and snippets.

@michaelkrieg
Created November 17, 2015 13:02
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save michaelkrieg/b565a81d3f9818830171 to your computer and use it in GitHub Desktop.
Save michaelkrieg/b565a81d3f9818830171 to your computer and use it in GitHub Desktop.
parse sitemap.xml and extract all URLs (e.g. for further random accesses)
#!/usr/bin/env python3
#
# credits go to: https://gist.github.com/chrisguitarguy/1305010
# rewritten for Python3 and bs4 by Michael Krieg <krieg@centrias-colocation.de>
#
from argparse import ArgumentParser
import requests
import bs4 as bs
def parse_sitemap(url):
resp = requests.get(url)
if 200 != resp.status_code:
return False
soup = bs.BeautifulSoup(resp.content, 'xml')
urls = soup.findAll('url')
if not urls:
return False
out = []
for u in urls:
loc = u.find('loc').string
prio = u.find('priority').string
change = u.find('changefreq').string
last = u.find('lastmod').string
out.append([loc, prio, change, last])
return out
if __name__ == '__main__':
options = ArgumentParser()
options.add_argument('-u', '--url', action='store', dest='url', help='Link to remote sitemap.xml', required=True)
args = options.parse_args()
urls = parse_sitemap(args.url)
if not urls:
print('There was an error!')
for u in urls:
print(u[0])
@michaelkrieg
Copy link
Author

and you need:

$ pip freeze --local
beautifulsoup4==4.4.1
lxml==3.5.0
requests==2.8.1
wheel==0.24.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment