Skip to content

Instantly share code, notes, and snippets.

@vpetersson
Last active October 8, 2019 13:54
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 4 You must be signed in to fork a gist
  • Save vpetersson/f20efe6194460cc28d49 to your computer and use it in GitHub Desktop.
Save vpetersson/f20efe6194460cc28d49 to your computer and use it in GitHub Desktop.
Parse and dump a sitemap (using Python)
#! /usr/bin/env python
# -*- coding: utf-8 -*-
"""
Inspired by Craig Addyman (http://www.craigaddyman.com/parse-an-xml-sitemap-with-python/)
Enhanced by Viktor Petersson (http://viktorpetersson.com) / @vpetersson
"""
from bs4 import BeautifulSoup
import requests
def get_sitemap(url):
get_url = requests.get(url)
if get_url.status_code == 200:
return get_url.text
else:
print 'Unable to fetch sitemap: %s.' % url
def process_sitemap(s):
soup = BeautifulSoup(s)
result = []
for loc in soup.findAll('loc'):
result.append(loc.text)
return result
def is_sub_sitemap(s):
if s.endswith('.xml') and 'sitemap' in s:
return True
else:
return False
def parse_sitemap(s):
sitemap = process_sitemap(s)
result = []
while sitemap:
candidate = sitemap.pop()
if is_sub_sitemap(candidate):
sub_sitemap = get_sitemap(candidate)
for i in process_sitemap(sub_sitemap):
sitemap.append(i)
else:
result.append(candidate)
return result
def main():
sitemap = get_sitemap('https://www.cloudsigma.com/sitemap.xml')
print '\n'.join(parse_sitemap(sitemap))
if __name__ == '__main__':
main()
@HQJaTu
Copy link

HQJaTu commented Jul 11, 2018

Above code doesn't account for <sitemap><loc> URL to have any arguments in it.

I created an improved version at https://gist.github.com/HQJaTu/cd66cf659b8ee633685b43c5e7e92f05 to address that issue. The obvious solution is to first parse the url, and check the URL path-part.

@dvir-cdsoft
Copy link

hi,
sorry for the question, im new at python
to where the code dump the sitemap ?
do i need to write any writing to file ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment