Skip to content

Instantly share code, notes, and snippets.

@azhawkes
Created January 13, 2014 18:00
Show Gist options
  • Star 20 You must be signed in to star a gist
  • Fork 5 You must be signed in to fork a gist
  • Save azhawkes/8404931 to your computer and use it in GitHub Desktop.
Save azhawkes/8404931 to your computer and use it in GitHub Desktop.
Really simple wget spider to obtain a list of URLs on a website, by crawling n levels deep from a starting page.
#!/bin/bash
HOME="http://www.yourdomain.com/some/page"
DOMAINS="yourdomain.com"
DEPTH=2
OUTPUT="./urls.csv"
wget -r --spider --delete-after --force-html -D "$DOMAINS" -l $DEPTH "$HOME" 2>&1 \
| grep '^--' | awk '{ print $3 }' | grep -v '\. \(css\|js\|png\|gif\|jpg\)$' | sort | uniq > $OUTPUT
@obriat
Copy link

obriat commented Oct 10, 2017

There is a unwanted space in the grep -v between the period and the extensions.

@Ontario7
Copy link

Ontario7 commented Dec 25, 2017

this grep expression does not fits urls with a parameters like that:
/includes/js/jquery.form.min.js?ver=3.51.0-2014.06.20
/wpml-cms-nav/res/css/navigation.css?ver=1.4.21

Alternatively you can use parameter for wget:
--reject=css,js,jpg,jpeg,png,gif

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment