Shell function that returns how many captures the Wayback Machine lists for a page/domain
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
function ia-check() { | |
echo $(curl -s -m60 "https://web.archive.org/web/*/$(echo "$*" | sed 's# #%20#g;s#/$#/\*#')" | | |
head -c10KB | | |
grep -m1 -Poi "(Saved <strong>\d+ time(s)?)|((\d+,)*\d+ URLs have been captured for this domain)|(Page cannot be crawled or displayed due to robots\.txt)|(This URL has been excluded from the Wayback Machine)|(Wayback Machine doesn't have that page archived)|(504 Gateway Time-out)" | | |
sed "s#'#'#g;s#<strong>##") | |
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Examples: | |
ia-check http://www.archiveteam.org # Save count for the individual page | |
ia-check http://www.archiveteam.org/ # Number of saved pages from the www sub-domain | |
ia-check www.archiveteam.org/ # Same as above | |
ia-check en.wikipedia.org/blabbyblahblah/ # Number of saved pages from the path blabbyblahblah | |
# Note that for domains with lots of captures, ia-check may time-out | |
# or the numbers printed may be completely inaccurate. | |
# Also, if a website only partially blocks the Wayback Machine, these numbers could be | |
# very inaccurate. There are probably other corner-cases I haven't thought of. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment