Skip to content

Instantly share code, notes, and snippets.

@simonw
Created December 9, 2016 06:38
Show Gist options
  • Star 59 You must be signed in to star a gist
  • Fork 11 You must be signed in to fork a gist
  • Save simonw/27e810771137408fd7834ad153750c41 to your computer and use it in GitHub Desktop.
Save simonw/27e810771137408fd7834ad153750c41 to your computer and use it in GitHub Desktop.
Recursive wget ignoring robots
$ wget -e robots=off -r -np 'http://example.com/folder/'
  • -e robots=off causes it to ignore robots.txt for that domain
  • -r makes it recursive
  • -np = no parents, so it doesn't follow links up to the parent folder
@wodim
Copy link

wodim commented Feb 17, 2019

Remember that the -r option has a default maximum depth of 5. I think --mirror is, overall, a better choice.

Copy link

ghost commented May 25, 2019

-r -l 0 removes the maximum depth limit.

@taoyichen
Copy link

wget -e robots=off -r -np --page-requisites --convert-links
For websites

@tumelo-mapheto
Copy link

Thanks for sharing.

@fsiler
Copy link

fsiler commented Feb 28, 2021

I'm still getting no-follow attribute found in $URL. Will not follow any links on this page after using wget -e robots=off -r -np --page-requisites --convert-links $SITE. Is this a bug?

@NilsIrl
Copy link

NilsIrl commented Apr 16, 2021

I'm still getting no-follow attribute found in $URL. Will not follow any links on this page after using wget -e robots=off -r -np --page-requisites --convert-links $SITE. Is this a bug?

Yes, this is a bug, it should be fixed in the next version of wget: https://git.savannah.gnu.org/cgit/wget.git/commit/?id=f1cccd2c454fb416e75a22b358b0a11266642007

See https://www.reddit.com/r/DataHoarder/comments/mprq89/wget_respects_nofollow_attribute_despite_e/guct2s5/ for more details

@thewhitegrizzli
Copy link

not fixed

@jimsy3
Copy link

jimsy3 commented Dec 6, 2023

what is the recursive thing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment