Skip to content

Instantly share code, notes, and snippets.

@pix0r
Last active December 24, 2022 00:21
Show Gist options
  • Star 16 You must be signed in to star a gist
  • Fork 4 You must be signed in to fork a gist
  • Save pix0r/6083058 to your computer and use it in GitHub Desktop.
Save pix0r/6083058 to your computer and use it in GitHub Desktop.
Use wget to scrape all URLs from a sitemap.xml Usage: scrape-sitemap.sh http://domain.com/sitemap.xml
#!/bin/sh
SITEMAP=$1
if [ "$SITEMAP" = "" ]; then
echo "Usage: $0 http://domain.com/sitemap.xml"
exit 1
fi
XML=`wget -O - --quiet $SITEMAP`
URLS=`echo $XML | egrep -o "<loc>[^<>]*</loc>" | sed -e 's:</*loc>::g'`
echo $URLS | tr ' ' '\n' | wget -O /dev/null -i - --wait=1 --random-wait -nv
@hanchiang
Copy link

Thanks a lot for this bash script! Saved me a lot of head banging 💯

@FlusherDock1
Copy link

Thanks for the script!

@plittlefield
Copy link

Wow, just what I was looking for, thanks!

@rabihmb
Copy link

rabihmb commented Sep 15, 2020

Is there a vise versa? Can I build a sitemap of a website?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment