Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Really simple wget spider to obtain a list of URLs on a website, by crawling n levels deep from a starting page.
#!/bin/bash
HOME="http://www.yourdomain.com/some/page"
DOMAINS="yourdomain.com"
DEPTH=2
OUTPUT="./urls.csv"
wget -r --spider --delete-after --force-html -D "$DOMAINS" -l $DEPTH "$HOME" 2>&1 \
| grep '^--' | awk '{ print $3 }' | grep -v '\. \(css\|js\|png\|gif\|jpg\)$' | sort | uniq > $OUTPUT
@obriat

This comment has been minimized.

Copy link

@obriat obriat commented Oct 10, 2017

There is a unwanted space in the grep -v between the period and the extensions.

@Ontario7

This comment has been minimized.

Copy link

@Ontario7 Ontario7 commented Dec 25, 2017

this grep expression does not fits urls with a parameters like that:
/includes/js/jquery.form.min.js?ver=3.51.0-2014.06.20
/wpml-cms-nav/res/css/navigation.css?ver=1.4.21

Alternatively you can use parameter for wget:
--reject=css,js,jpg,jpeg,png,gif

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment