Skip to content

Instantly share code, notes, and snippets.

@molotovbliss
Created February 7, 2014 15:53
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save molotovbliss/8865428 to your computer and use it in GitHub Desktop.
Save molotovbliss/8865428 to your computer and use it in GitHub Desktop.
wget crawler
Open 3 crawlers recursive, 4 levels deep and only crawl sites with domain thesite.com
wget -r -l4 –spider -D thesite.com http://www.thesite.com &
wget -r -l4 –spider -D thesite.com http://www.thesite.com &
wget -r -l4 –spider -D thesite.com http://www.thesite.com
Alternative to wget: http://aria2.sourceforge.net/
@a-r-m-i-n
Copy link

Doesn't this crawl the same pages three times in a row? Or is wget that smart to split the resources on its own?

@molotovbliss
Copy link
Author

Got curious about this...
echo $URL_LIST | xargs -n 1 -P 8 wget -q seems to be the best way without aria2 dependency...
https://stackoverflow.com/questions/7577615/parallel-wget-in-bash

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment