Skip to content

Instantly share code, notes, and snippets.

Created January 13, 2014 18:00
What would you like to do?
Really simple wget spider to obtain a list of URLs on a website, by crawling n levels deep from a starting page.
wget -r --spider --delete-after --force-html -D "$DOMAINS" -l $DEPTH "$HOME" 2>&1 \
| grep '^--' | awk '{ print $3 }' | grep -v '\. \(css\|js\|png\|gif\|jpg\)$' | sort | uniq > $OUTPUT
Copy link

obriat commented Oct 10, 2017

There is a unwanted space in the grep -v between the period and the extensions.

Copy link

Ontario7 commented Dec 25, 2017

this grep expression does not fits urls with a parameters like that:

Alternatively you can use parameter for wget:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment