Skip to content

Instantly share code, notes, and snippets.

@kulas
Last active August 8, 2022 19:28
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save kulas/9a822979242cc42809ff626606f91feb to your computer and use it in GitHub Desktop.
Save kulas/9a822979242cc42809ff626606f91feb to your computer and use it in GitHub Desktop.

Archiving Websites

Every so often, you may find yourself needing to preserve a website in its current state. Whether this is in advance of a significant website change, general documentation, or the possibility that the site needs to be taken offline, it is a good idea to archive the site so that it is navigable locally and without the need for a server.

Introducing HTTrack HTTrack Website Copier will do just that, download a website to a local directory, build all the directories, get HTML, images, and other files from the server to your computer.

HTTrack has a GUI for Windows that works really well: http://www.httrack.com . You can also use this tool from the command line. See the following steps to archive websites using the HTTrack command line tools.

MacOS Installation: First, you will need to install HTTrack locally. On macOS, https://brew.sh/ is the package manager of choice (as opposed to MacPorts). Homebrew is simpler to set up but does require your machine to have Xcode installed. Be sure both Xcode and Homebrew are up to date before attempting to install the HTTrack package. Once Xcode and Homebrew have been updated, install the HTTrack package with Homebrew:

$ brew install httrack

There may be some trial and error to insure all images, files, and dependencies are included in the archive. All in all, it is a pretty slick tool. One thing I’ve noticed is that all images references in CSS files needed to be quoted in order for HTTrack to properly archive those files. Other than that, everything about HTTrack just worked. So be sure all image paths in CSS files are quoted before doing your first scrape and you should be good to go.

Archiving:

  1. Navigate to the folder where you want to build your archive
  2. Use the HTTrack command, domain name, and any parameters you need.              a. This command works well for most situations:
$ httrack domain-name.tld -n
  1. This should scrape everything in that directory and build a “flat” html version of the website
  2. Make adjustments to the index.html files (see readme file in any one of the previous archives on the Staging server)
  3. Create a new readme file with anything worth noting
  4. If this is a useful archive for the Museum consider moving to a location where we are archiving websites. I store mine on the Share > Web and Digital

Note that Line #2 in the its-log.txt file explains what parameters were used in the scrape. This information as well as the other cache files are useful in recording what has been done. 

Also, use the HTTrack forum and the user guide to tweak the scrape as needed:

Notes: External videos cannot be downloaded. HTTrack grabs the swf but not the video. It will point to the video on whatever service is hosting it. See notes on the HTTrack website: https://forum.httrack.com/readmsg/30291/30290/index.html

Tools:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment