Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Crawl website, get list of all URLs - one command line
wget --mirror --delete-after --no-directories http://your.website.com 2>&1 | grep '^--' | awk '{print $3}' | sort >urls.txt
@softplus

This comment has been minimized.

Copy link
Owner Author

@softplus softplus commented Apr 17, 2021

Variation that includes the HTTP status codes (from https://twitter.com/Errioxa/status/1381484805766909958 )

wget -mirror --delete-after --no-directories http://your.website.com 2>&1 | egrep '(^(--)|\s[0-9]{3,3}\s)' |awk '{print $3"\t"$6}'|egrep '^http|.*[0-9]{3,3}$'| xargs -n2 -d'\n'

@softplus

This comment has been minimized.

Copy link
Owner Author

@softplus softplus commented Apr 17, 2021

Get just the path parts of the URL:
cat urls.txt | sed "s/^.*\/\/[^\/]*\//\//"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment