n0samu/web-archival-guide.md

## web-archival-guide.md

      
    Raw
  

              web-archival-guide.md
            
          
    In this brief guide, I will share what I've learned about archiving live webpages and recovering deleted webpages using various archive services.
Before you start

I highly recommend installing the Web Archives extension in your web browser (Chrome, Firefox).
It provides quick access to archive services and search engine caches.
Overview of archive services & caches


Wayback Machine: Best and largest webpage archive. You probably already use it. Wayback Machine is the only archival service that also performs automated crawling, so its coverage is much better than other archives. See usage tips on Wikipedia.
Archive.today (also known as Archive.is): A mid-sized service that takes a different approach from Wayback Machine; it executes all page JavaScript, then saves the resulting HTML content as a static page. This approach works well for some JavaScript-heavy pages. See information and usage tips on Wikipedia. Notably, Archive.today is able to back up captures from other services, automatically detecting a capture's original URL and indexing it accordingly. This allows other users to find the archived page by searching for its original URL.
Ghostarchive: A smaller archive service that uses the Webrecorder suite to store and replay webpage captures. This works particularly well for interactive JavaScript-heavy pages. Ghostarchive also has custom setups for capturing posts on X/Twitter, YouTube, and possibly other services. See usage tips on Wikipedia.
Conifer: A very small and specialized web archiving service that allows registered users to create collections of webpage captures using Webrecorder. Collections can be downloaded as WARC files or shared publicly. To learn how to use Conifer, read the User Guide.
Megalodon: A small web archiving service based in Japan. Refuses to archive pages on sites that use robots.txt. Not particularly good for archiving complex pages, but useful for archiving pages from some Japanese sites. See usage tips on Wikipedia.
FreezePage: A very old webpage capture service that does not execute JavaScript. Captures are deleted after 30 days of account inactivity, or after 3 days for unregistered users! Do not use FreezePage for permanent archival!  You should immediately save any FreezePage captures to Archive.today.
Yandex cache: Accessible from a dropdown menu in Yandex search results. Yandex's cache is quite comprehensive, but not every result will have a cache available. Searching for URLs on Yandex often triggers a CAPTCHA though, at least for me.
Google cache: Formerly accessible using the cache: operator in Google search. Unfortunately, Google removed its cache feature entirely in September 2024, so this no longer works.
Bing cache: Formerly accessible from a dropdown menu in Bing search results. Unfortunately, Bing removed its cache feature in December 2024.

Recovering a deleted webpage

If you come across a webpage that has been deleted, you can use the Web Archives extension to try to find an existing archived or cached version of it.
Just click the extension icon, then click the service that you want to check. Or click "All Search Engines" to check all of them at once!
If the page you're looking for immediately redirects or is completely inaccessible, you can still use the extension. Just click the "Tab" dropdown in the top-left corner of the extension popup, then switch to "URL" and paste in the URL you want to look for.

Generally I check these services in order:

Wayback Machine
Archive.today
Ghostarchive
Yandex cache

Note that search engine caches are not permanent! If you find a deleted webpage in Yandex's cache, you will need to save it to a permanent archive. Unfortunately, Yandex seems to block the Wayback Machine. Currently I recommend saving pages from Yandex to Archive.today. To do this, you will need to manually navigate to Archive.today and paste the Yandex URL into the "My url is alive and I want to archive its content" box. Other methods tend to mangle the URL and cause the capture to break.
Here are some extra tips for specific sites:

YouTube Videos: Use the YouTube Video Finder service.
X/Twitter: Wayback Machine is no longer able to archive tweets in replayable form, but some tweets are still saved in JSON form. To access JSON captures, change the URL from x.com to twitter.com and search for the first capture. As of May 2025, Wayback Machine displays HTML previews of these captures, so viewing them is much more convenient than before. Here is an example.

Unfortunately, the chances of recovering a deleted webpage are much diminished from when I first wrote this guide. In 2024, both Google and Bing shut down public access to their caches; the only significant automated crawling services that still provide public access are Wayback Machine and Yandex. Neither are likely to capture ephemeral pages like social media posts. Additionally, more and more websites are blocking automated crawling as they recognize the value of their content and attempt to tighten control over how it is used. Now more than ever before, if you hope to ever revisit a piece of content on the internet, you will need to save it yourself!
Archiving a live webpage

It's important to proactively save webpages you visit to ensure you can return to them later. Wayback Machine's Save Page Now service usually works well for this. But some sites are more tricky;  Here I will provide tips for dealing with those sites. Audio and video content often must be downloaded manually; the tool I recommend for this is yt-dlp, but if you're not comfortable using the command line, cobalt is a great alternative.

Airtable: Use Archive.today to save the page, then download each table as a CSV file using the three-dots or dropdown menu and upload the files to the Internet Archive.
Binary/raw files: Wayback Machine and Megalodon support archiving raw files, but Archive.today and Ghostarchive do not; they only support archiving webpages.
Bluesky posts: Use Wayback or Archive.today. (In the past, Wayback Machine did not work, but as of June 2025 it now does.)
eBay auction pages: Use Archive.today. (Other services do not save full-size images)
Facebook posts: Use Archive.today. To save videos, download them with yt-dlp and upload them to the Internet Archive.
Google Docs: Change the end of the URL from /edit to /mobilebasic to load a plain HTML version of the document. Then save to Wayback or any other archive service.
Google Sheets: Change the end of the URL from /edit to /htmlview to load a plain HTML version of the spreadsheet. Then save to Wayback or any other archive service.
Imgur images: No archive service is able to save full-size images or large albums. Use Archive.today or Megalodon to save the page, then download the image or album and upload it to the Internet Archive.
Instagram posts: Use Archive.today. Unfortunately, on posts with more than two images, the latter images may fail to save.
Linktree pages: Use Ghostarchive or Megalodon.
Mastodon posts: Wayback may be unable to save posts from some instances; use Archive.today instead for those.
Microsoft Sway presentations: Use Ghostarchive.
News articles (in general): Use Archive.today. Often bypasses paywalls.
Peatix event pages: Use Megalodon.
Reddit threads:

For text threads: Change the URL from www.reddit.com to old.reddit.com, then save to Wayback Machine.
For threads with images: Save to Archive.today. Unfortunately, image replies will not be saved unless you also save each <image> link manually. Reddit blocks Ghostarchive so using that is not an option.


Threads posts: Use Archive.today or Ghostarchive.
Soundcloud tracks: Use Archive.today or Ghostarchive to save the page, then download the track and upload it to the Internet Archive.

Some tracks are downloadable at their original quality from the three-dots "More" menu. You'll need a Soundcloud account, but the account does not need a verified email address.
If a track is not downloadable, you can still download it at streaming quality with yt-dlp.


TikTok videos: Use Conifer. It is the only archive service that supports playback of TikTok videos. Or use Ghostarchive or Wayback to save the page, then save the video and upload it to the Internet Archive. Many TikTok videos can be saved by right-clicking them and selecting "Download video." yt-dlp can save videos if downloads are disabled.
Tumblr posts:

Archive.today works very reliably, but it sometimes fails to save posts from the Tumblr dashboard (URLs of the form www.tumblr.com/exampleblog/ID). And it cannot save videos.
If Archive.today fails to save a post, try capturing it with FreezePage and immediately saving the FreezePage capture to Archive.today.
Megalodon often works for blogs with custom themes (URLs of the form exampleblog.tumblr.com/post/ID).
Wayback Machine and Ghostarchive are able to save text posts, but fail to save images. Strangely, Wayback Machine is sometimes able to save videos; here is an example.


Vimeo videos: Use Archive.today to save the page, then download the video with yt-dlp and upload it to the Internet Archive.
X/Twitter posts:

For individual tweets with images: Use Archive.today or Megalodon.
For threads: Use Ghostarchive. It is the only service that shows replies and threads because it uses real Twitter accounts. But it sometimes fails to save media as of February 2025.
For videos: download the video with yt-dlp and upload it to the Internet Archive.


Yahoo Japan Auctions pages: Blocks Wayback Machine, but all other archive services should work.
YouTube videos: Save the YouTube page to Wayback Machine using Save Page Now, then check back in a few days to make sure the video was saved. Ghostarchive can be used instead for shorter videos. Megalodon may be used to save some of the top comments, excluding any replies. Megalodon is also able to save YouTube Community posts.