Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save jeremyboggs/0142d08afdb74aefd999eb632d8f8276 to your computer and use it in GitHub Desktop.
Save jeremyboggs/0142d08afdb74aefd999eb632d8f8276 to your computer and use it in GitHub Desktop.
wget a single Omeka Classic exhibit not hosted on Omeka.net
# Caveats
# I'm no shell expert
# This is in Bash on a Mac running Mojave
# You may need to read the whole thing before using
#
# Explanation
# (hopefully intelligible to the advanced beginner or low intermediate Bash user)
#
# * Slashes at the line ends tell the shell to consider the next line part of the command
#
# * --include-directories=document,application,exhibits,plugins,themes,items,files,file
# Since we are wanting a subdirectory, we have to tell it to also make sure to get the directories that aren't dependencies but are required to re-display the exhibit. The ones listed here are the ones I found I needed.
#
# * --reject-regex
# I don't want all the XML, RSS, JSON page versions, so this regex will tell wget to ignore the links with 'output=' in them.
#
# * --page-requisites
# Get the things the page needs to display, e.g. CSS, JS. This param is partly why you include specific directories.
#
# * --convert-links
# Turn links in page markup into relative links so you don't need to think about whether you are re-displaying this exhibit as a root directory or in a subdirectory.
#
# * --span-hosts
# Some files are outside your domain, like Google fonts or mapping things.
#
# * --domains=[comma-separated list]
# However, you don't want to scrape the internet, so tell wget which domains you want it to go to. I put in the FQDN of the exhibit I'm scraping just to be sure.
#
# * --adjust-extension
# Add the appropriate extension to files that are represented online as extensionless.
#
# * --recursive
# Follow all the links in the exhibit.
#
# * --timestamping
# The target URL is scanned in search of new files. Only those new files will be downloaded in the place of the old ones. Most useful if you are executing this a second time, but do it the first time for completeness. Also useful, though, if you have to stop the download and resume because it's taking too long.
#
# * --wait 1
# Pause 1 second between each Request. You can adjust the number if the server you're talking to gets twitchy about how frequently you are Requesting files.
#
# * --directory-prefix
# Though the word used is 'prefix', this is a way of telling wget what directory to download into.
#
# * --execute-robots=off
# Ignore robots.txt files. I have this as baseline just so I don't need to care whether the exhibit has a blocking robots file.
#
# * --output-file=[logfile name].log
# Keep a log of the scrape so you can look at it during (tail -f logfile-name.log) or after.
#
# * [target URI]
# The URI of the exhibit you want. Usually something like example.com/exhibits/show/exhibit-slug
#
# * 2>/dev/null
# Send error reports to a bottomless pit. You may prefer to see the errors as they occur. I find that figuring them out by reading the logfile or running a linkchecker against the wgetted sites works better for me.
#
#
wget \
--include-directories=document,application,exhibits,plugins,themes,items,files,file \
--reject-regex='output=' \
--page-requisites \
--convert-links \
--span-hosts \
--domains=[target FQDN],googleapis.com,maps.google.com,maxcdn.bootstrapcdn.com \
--adjust-extension \
--recursive \
--timestamping \
--wait 1 \
--directory-prefix=[directory name] \
--execute robots=off \
--output-file=[logfile name].log \
[target URI] \
2>/dev/null
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment