Skip to content

Instantly share code, notes, and snippets.

What would you like to do?
Notes/instructions for how I've been mirroring sites to
Mirroring sites to
1) Use `wget` to pull down a copy of the site. If pulling down a single file & all its prerequisites, use the following:
/usr/local/bin/wget -p --mirror -k -t 30 -w 5 -e robots=off -o$(date +%Y-%m-%d-%H%M).log "" &
Alternatively, if pulling down an entire directory, use the following (Note: _make sure_ you include the trailing slash on the directory name!):
/usr/local/bin/wget --mirror -k -t 30 -w 5 -e robots=off -o$(date +%Y-%m-%d-%H%M).log "" &
Then tail the log to watch it go:
tail -F
2) Fix `.nav.homepage.js`, it will have absolute paths that need to be converted to relative. In `vim`, you can use the following to do so (in this example for
3) Browse the site for any broken links and fetch those pages again using `wget` (remember to add the `-p` page-requisites option to those in wget_instructions.txt)
4) Any albums should have inline JavaScript lines that contain `new Slide('`, you will need to fetch these files manually (`curl -O` should be adequate for this) and then replace the absolute paths in the `new Slide()` calls with correct paths. (`wget` can't handle JavaScript, hence this being manual & required). Unfortunately, the way the slideshow code works, they still have to be absolute paths; fortunately, you can just strip off the 'http:/' so they become `new Slide('/` and that'll work.
5) Search all pages for any absolute links back to and try to fix them:
grep -i -R \"http:\/\/homepage\.mac\.com\/ * | more
Sometimes they'll even get duplicated so you end up with something like "g3head.1". If you've found such a case and you _absolutely know_ that all occurrences in all files can be replaced, you can do something like the following (after deleting the "g3head" folder, in this example):
find . -type f -print0 | xargs -0 sed -i '' 's/\/g3head\.1/\/index.html/g'
6) Watch out for directories that have been linked to without the trailing slash, a file will be created instead of a directory with an index file. You'll have to fix those! The following find command will reveal non-directories that don't have common file names and should help with the hunt:
find . -type f ! -iname "*.txt" ! -iname "*.html" ! -iname "*.htm" ! -iname "*.css" ! -iname "*.js" ! -iname "*.gif" ! -iname "*.png" ! -iname "*.jpeg" ! -iname "*.jpg"
7) FileSharing pages tend to redirect to a file in /WebObjects/woa/ which `wget` will have downloaded, but can't be loaded by Apache. In these cases move that file to replace the FileSharing page (since it's just a redirect anyway). Once moved, fix the absolute URLs, esp. the ones pointing to "/i/hpti/". For those, you can use the following commands in `vim` to replace the usual occurrences:
8) Browse the site, specifically watching for assets that might still be loading from or linking to it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.