Skip to content

Instantly share code, notes, and snippets.

@garyrh
Last active March 26, 2022 02:41
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save garyrh/2a373cc5a097433471fa to your computer and use it in GitHub Desktop.
Shell function that returns how many captures the Wayback Machine lists for a page/domain
function ia-check() {
echo $(curl -s -m60 "https://web.archive.org/web/*/$(echo "$*" | sed 's# #%20#g;s#/$#/\*#')" |
head -c10KB |
grep -m1 -Poi "(Saved <strong>\d+ time(s)?)|((\d+,)*\d+ URLs have been captured for this domain)|(Page cannot be crawled or displayed due to robots\.txt)|(This URL has been excluded from the Wayback Machine)|(Wayback Machine doesn&apos;t have that page archived)|(504 Gateway Time-out)" |
sed "s#&apos;#'#g;s#<strong>##")
}
# Examples:
ia-check http://www.archiveteam.org # Save count for the individual page
ia-check http://www.archiveteam.org/ # Number of saved pages from the www sub-domain
ia-check www.archiveteam.org/ # Same as above
ia-check en.wikipedia.org/blabbyblahblah/ # Number of saved pages from the path blabbyblahblah
# Note that for domains with lots of captures, ia-check may time-out
# or the numbers printed may be completely inaccurate.
# Also, if a website only partially blocks the Wayback Machine, these numbers could be
# very inaccurate. There are probably other corner-cases I haven't thought of.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment