Skip to content

Instantly share code, notes, and snippets.

@Asparagirl
Last active November 25, 2018 21:24
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Asparagirl/c2f710724232f76187b3 to your computer and use it in GitHub Desktop.
Save Asparagirl/c2f710724232f76187b3 to your computer and use it in GitHub Desktop.
Grab a website with wpull and PhantomJS

Grab a website with wpull and PhantomJS

export USER_AGENT="Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27"
export DOMAIN_NAME_TO_SAVE="http://www.example.com/"
export DOMAINS_TO_INCLUDE="example.com,images.example.com,relatedwebsite.com"
# this one can be regex, or you can leave it out, whatever
export THINGS_TO_IGNORE="ignore-this,other-thing-to-ignore"
export WARC_NAME="Example.com_-_2014-10-15"
# these two are needed in case wpull quits or chokes and we need to restart where we left off
export DB_NAME="example.db"
export LOG_NAME="example.log"

If you're grabbing a forum, this is good starter list of items to ignore:

export THINGS_TO_IGNORE="/cron\\.php\\?,/external\\.php\\?type=rss,/login\\.php\\?,/newreply\\.php\\?,/private\\.php\\?,/privmsg\\.php\\?,/register\\.php\\?,/sendmessage\\.php\\?,/subscription\\.php\\?,/posting\\.php\\?,/viewtopic\\.php\\?.+&view=(next|previous),/viewtopic\\.php\\?.+&hilit=,/feed\\.php\\?,/index\\.php\\?option=com_mailto,&view=login&return=,&format=opensearch,/misc\\.php\\?do=whoposted,/newthread\\.php\\?,/post_thanks\\.php\\?,/blog_post\\.php\\?do=newblog,/forumdisplay\\.php\\?do=markread,/userpoll/vote\\.php\\?,/showthread\\.php.*[\\?&]goto=(next(old|new)est|newpost),/editpost\\.php\\?,/\\?view=getlastpost$,/index\\.php\\?sharelink=,/ucp\\.php\\?mode=delete_cookies"

(List taken from https://github.com/ArchiveTeam/ArchiveBot/blob/578c8c9e6374705926f9c57d4e107230b01c53e3/db/ignore_patterns/forums.json )

*** IMPORTANT: If you continue a grab that previously quit or stalled, change the name of the WARC, or it will be overwritten!

wpull "$DOMAIN_NAME_TO_SAVE" --warc-file "$WARC_NAME" --no-check-certificate --no-robots --save-headers --save-cookies "cookies.txt" --keep-session-cookies --user-agent "$USER_AGENT" --wait 1 --random-wait --waitretry 600 --page-requisites --recursive --level 20 --sitemaps --span-hosts --domains "$DOMAINS_TO_INCLUDE" --reject-regex "$THINGS_TO_IGNORE" --retry-connrefused --retry-dns-error --delete-after --database "$DB_NAME" --verbose --output-file "$LOG_NAME" --warc-max-size "1000000000" --phantomjs --phantomjs-scroll "25" --phantomjs-wait "3"

If you don't want to run with PhantomJS, which can be buggy and slows things down, remove these lines from the string:

--phantomjs --phantomjs-scroll "25" --phantomjs-wait "3"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment