Skip to content

Instantly share code, notes, and snippets.

@n0samu
Last active July 6, 2024 11:21
Show Gist options
  • Save n0samu/c8ed07ac640c86db5a753fe466c1b900 to your computer and use it in GitHub Desktop.
Save n0samu/c8ed07ac640c86db5a753fe466c1b900 to your computer and use it in GitHub Desktop.
Essential tips for web archiving.

In this brief guide, I will share what I've learned about archiving live webpages and recovering deleted webpages using various archive services.

Before you start

I highly recommend installing the Web Archives extension in your web browser (Chrome, Firefox). It provides quick access to archive services and search engine caches.

Overview of archive services & caches

  • Wayback Machine: Best and largest webpage archive. You probably already use it. Wayback Machine is the only archival service that also performs automated crawling, so its coverage is much better than other archives. See usage tips on Wikipedia.
  • Archive.today (also known as Archive.is): A mid-sized service that takes a different approach from Wayback Machine; it executes all page JavaScript, then saves the resulting HTML content as a static page. This approach works better for some JavaScript-heavy pages. See information and usage tips on Wikipedia.
  • Ghostarchive: A smaller archive service that uses the Webrecorder suite to store and replay webpage captures. This works particularly well for complex JavaScript-heavy sites. Ghostarchive also has custom setups for capturing social media posts on X/Twitter, Instagram, and possibly other services. See usage tips on Wikipedia.
  • Megalodon: An even smaller web archiving service based in Japan. Refuses to archive pages on sites that use robots.txt. Not particularly good for archiving complex pages, but useful for archiving pages from some Japanese sites. See usage tips on Wikipedia.
  • Google cache: Accessible using the cache: operator in Google search. For example, search cache:https://example.com to view Google's cache of https://example.com.
  • Bing cache: Accessible from a dropdown menu in Bing search results. Use the url: operator (example: url:https://example.com) to show results only for the exact URL you're looking for. Not every result will have a cache available. Bing cache is also accessible from Yahoo search results, but if a cache is not available from Bing it will not be available from Yahoo either.
  • Yandex cache: Accessible from a dropdown menu in Yandex search results. Unfortunately, searching for URLs on Yandex usually triggers a very time-consuming CAPTCHA, so I rarely bother. Yandex's cache is fairly comprehensive though.

Recovering a deleted webpage

If you come across a webpage that has been deleted, you can use the Web Archives extension to try to find an existing archived or cached version of it. Just click the extension icon, then click the service that you want to check. Or click "All Search Engines" to check all of them at once! If the page you're looking for immediately redirects or is completely inaccessible, you can still use the extension. Just click the "Tab" dropdown in the top-left corner of the extension popup, then switch to "URL" and paste in the URL you want to look for.

Web Archives extension screenshot

Generally I check these services in order:

  1. Wayback Machine
  2. Archive.today
  3. Google cache
  4. Bing cache
  5. Ghostarchive (if I'm desperate; unlikely to have any given page due to the small size & obscurity of the service)
  6. Yandex cache (if I'm desperate; access is inconvenient due to CAPTCHA)

Note that search engine caches are not permanent! If you find a deleted webpage on a search engine cache (Google, Bing, Yandex) you will need to save it to a permanent archive. It is recommended to use Archive.today for this because it automatically detects the original URL of a cached page and indexes it accordingly. This allows other users to find the archived page by searching for its original URL.

Here are some extra tips for specific sites:

  • YouTube Videos: Use the YouTube Video Finder service. Bing cache also has excellent coverage of YouTube if you just need metadata or proof that a video existed.
  • X/Twitter:
    • Tweets are still indexed under the twitter.com domain by search engine caches. Be sure to change the URL from x.com to twitter.com when searching. Google cache has very comprehensive coverage of Twitter.
    • Wayback Machine is no longer able to archive tweets in replayable form, but some tweets are still saved in raw JSON form. To access JSON captures, change the URL from x.com to twitter.com and search for the first capture. Here is an example.

Archiving a live webpage

It's important to proactively save webpages you visit to ensure you can return to them later. Wayback Machine's Save Page Now service usually works well for this. But some sites are more tricky; Here I will provide tips for dealing with those sites. Audio and video content often must be downloaded manually; the tool I recommend for this is yt-dlp, but if you're not comfortable using the command line, cobalt is a great alternative.

  • Airtable: Use Archive.today to save the page, then download each table as a CSV file using the three-dots or dropdown menu and upload the files to the Internet Archive.
  • Bluesky posts: Use Ghostarchive or Archive.today.
  • eBay auction pages: Use Archive.today. (Other services do not save full-size images)
  • Facebook posts: Use Archive.today. To save videos, download them with yt-dlp and upload them to the Internet Archive.
  • Google Docs: Change the end of the URL from /edit to /mobilebasic to load a plain HTML version of the document. Then save to Wayback or any other archive service.
  • Google Sheets: Change the end of the URL from /edit to /htmlview to load a plain HTML version of the spreadsheet. Then save to Wayback or any other archive service.
  • Imgur images: No archive service is able to save full-size images or large albums. Use Archive.today or Megalodon to save the page, then download the image or album and upload it to the Internet Archive.
  • Instagram posts: Use Ghostarchive. (Archive.today also works, but only for posts with one or two images).
  • LinkedIn profiles: Use Archive.today.
  • Mastodon posts: Wayback may be unable to save posts from some instances; use Archive.today instead for those.
  • Microsoft Sway presentations: Use Ghostarchive.
  • News articles (in general): Use Archive.today. Often bypasses paywalls.
  • Peatix event pages: Use Megalodon.
  • Reddit threads:
    • For text threads: Change the URL from www.reddit.com to old.reddit.com, then save to Wayback Machine.
    • For threads with images: Save to Archive.today. Unfortunately, image replies will not be saved unless you also save each <image> link manually. Reddit blocks Ghostarchive so using that is not an option.
  • Threads posts: Use Archive.today or Ghostarchive.
  • Soundcloud tracks: Use Archive.today or Ghostarchive to save the page, then download the track and upload it to the Internet Archive.
    • Some tracks are downloadable at their original quality from the three-dots "More" menu. You'll need a Soundcloud account, but the account does not need a verified email address.
    • If a track is not downloadable, you can still download it at streaming quality with yt-dlp.
  • TikTok videos: Use Ghostarchive to save the page, then download the video with yt-dlp and upload it to the Internet Archive.
  • Tumblr posts: Use Ghostarchive. (Wayback Machine may work for text posts, but it fails to save media.)
  • Vimeo videos: Use Archive.today to save the page, then download the video with yt-dlp and upload it to the Internet Archive.
  • X/Twitter posts:
    • Use Ghostarchive. It is the only service that shows replies and threads because it uses real Twitter accounts.
    • Ghostarchive will sometimes fail. If this happens, you will need to save each individual tweet to Archive.today (or try Ghostarchive again later).
    • If Archive.today also fails, you can try searching for each tweet in Google's cache, then saving the cached tweets to Archive.today. Assuming the tweets are cached by Google, this should work reliably. But media (images, videos) will not be saved.
  • Yahoo Japan Auctions pages: Blocks Wayback Machine, but all other archive services should work.
  • YouTube videos: Save the YouTube page to Wayback Machine using Save Page Now, then check back in a few days to make sure the video was saved. Ghostarchive can be used instead for shorter videos. Note that no archive service properly saves comments.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment