Skip to content

Instantly share code, notes, and snippets.

@maesa
Last active April 14, 2017 02:52
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save maesa/1a81053cba1b825e1e8a97245bd14028 to your computer and use it in GitHub Desktop.
Save maesa/1a81053cba1b825e1e8a97245bd14028 to your computer and use it in GitHub Desktop.
Create sitemap (ignoring NOINDEX,NOFOLLOW) using wget + bash
#!/usr/bin/env bash
if [ -z $1 ]; then
echo "Created by Barkeep (http://www.lostsaloon.com/technology/how-to-create-an-xml-sitemap-using-wget-and-shell-script/)"
echo "Usage: $0 http://webtobecrawled.com";
exit
fi
sitedomain=$1
wget --spider --recursive --level=inf --no-verbose --output-file=linklist.txt $sitedomain
grep -i URL linklist.txt | awk -F 'URL:' '{print $2}' | awk '{$1=$1};1' | awk '{print $1}' | sort -u | sed '/^$/d' > sortedurls.txt
header='<?xml version="1.0" encoding="UTF-8"?><urlset
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">'
echo $header > sitemap.xml
while read p; do
case "$p" in
*/ | *.html | *.htm)
echo '<url><loc>'$p'</loc></url>' >> sitemap.xml
;;
*)
;;
esac
done < sortedurls.txt
echo "</urlset>" >> sitemap.xml
rm linklist.txt
rm sortedurls.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment