Skip to content

Instantly share code, notes, and snippets.

@mikeclarke
Created December 12, 2010 22:21
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mikeclarke/738399 to your computer and use it in GitHub Desktop.
Save mikeclarke/738399 to your computer and use it in GitHub Desktop.
Script to use fuzzy matching to generate a new sitemap.xml
import re
from xml.dom import minidom
from difflib import get_close_matches
# The old sitemap.xml file
sourceXML = minidom.parse('sitemap-old.xml')
# sitemap.xml from the new site
targetXML = minidom.parse('sitemap-new.xml')
sourceNodes = sourceXML.getElementsByTagName('loc')
targetNodes = targetXML.getElementsByTagName('loc')
targetURLs = []
for u in targetNodes:
targetURLs.append(u.firstChild.data)
matches = {}
for u in sourceNodes:
m = u.firstChild.data
match = get_close_matches(m,targetURLs)
if match:
matches[m] = match[0]
for key, value in matches.items():
# Parse out the request, removing the 'http://domain.com' portion of the URL
source_url = re.search('^http[s]?:\/\/([A-Za-z0-9_.-]*)\/(.*)', key).group(2)
destination_url = value
print "RewriteRule ^/%s$ %s [R=301]\n" % (source_url, destination_url, )
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment