Skip to content

Instantly share code, notes, and snippets.

What would you like to do?
Script to retrieve content from google cache
import urllib
import urllib2
import re
import socket
import os
import time
import random
#adjust the site here
search_term="site:" + search_site
def main():
headers = {'User-Agent': 'Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv: Gecko/20070515 Firefox/'}
url = ""+search_term
regex_cache = re.compile(r'<a href="([^"]*)"[^>]*>Cached</a>')
regex_next = re.compile('<a href="([^"]*)"[^>]*><span[^>]*>[^<]*</span><span[^>]*>Next</span></a>')
#this is the directory we will save files to
counter = 0
pagenum = 0
more = True
pagenum += 1
print "PAGE "+str(pagenum)+": "+url
req = urllib2.Request(url, None, headers)
page = urllib2.urlopen(req).read()
matches = regex_cache.findall(page)
print matches
for match in matches:
if not match.startswith("http"):
match = "http:" + match
tmp_req = urllib2.Request(match.replace('&amp;','&'), None, headers)
tmp_page = urllib2.urlopen(tmp_req).read()
print counter,": "+match
f = open(search_site + "/" + str(counter)+'.html','w')
#comment out the code below if you expect to crawl less than 50 pages
print "sleeping for: " + str(random_interval) + " seconds"
#now check if there is more pages
match =
if match == None:
more = False
url = ""'&amp;','&')
if __name__=="__main__":
# vim: ai ts=4 sts=4 et sw=4
Copy link

egeozcan commented Jan 14, 2012

Don't know how much I can thank you and the original author. You saved my life (the part of it which was cached, at least) =))

Copy link

minhajuddin commented Jan 16, 2012

Glad that it helped you :)

Copy link

DomeDan commented Mar 30, 2012

Yeah thank you, and the original author of course!
Though I can inform you and other people that will try this out, that I was only able to download 63-67 pages from the cache then I got:
urllib2.HTTPError: HTTP Error 503: Service Unavailable
when checking the cache manually I see:
"Our systems have detected unusual traffic from your computer network. Please try your request again later."
I used time.sleep(random_interval) and tested this on 2 different ip-addresses
tried to change to random_interval=random.randrange(10,20,1) and tried on a third ip but that didn't help

Copy link

eqhes commented Aug 18, 2012

I've added "-u" param to pass the website URL from the command line:

Copy link

qtproduction commented Sep 26, 2012

This is my version, to fix 503 error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment