Skip to content

Instantly share code, notes, and snippets.

Created December 21, 2011 03:29
Show Gist options
  • Save minhajuddin/1504425 to your computer and use it in GitHub Desktop.
Save minhajuddin/1504425 to your computer and use it in GitHub Desktop.
Script to retrieve content from google cache
import urllib
import urllib2
import re
import socket
import os
import time
import random
#adjust the site here
search_term="site:" + search_site
def main():
headers = {'User-Agent': 'Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv: Gecko/20070515 Firefox/'}
url = ""+search_term
regex_cache = re.compile(r'<a href="([^"]*)"[^>]*>Cached</a>')
regex_next = re.compile('<a href="([^"]*)"[^>]*><span[^>]*>[^<]*</span><span[^>]*>Next</span></a>')
#this is the directory we will save files to
counter = 0
pagenum = 0
more = True
pagenum += 1
print "PAGE "+str(pagenum)+": "+url
req = urllib2.Request(url, None, headers)
page = urllib2.urlopen(req).read()
matches = regex_cache.findall(page)
print matches
for match in matches:
if not match.startswith("http"):
match = "http:" + match
tmp_req = urllib2.Request(match.replace('&amp;','&'), None, headers)
tmp_page = urllib2.urlopen(tmp_req).read()
print counter,": "+match
f = open(search_site + "/" + str(counter)+'.html','w')
#comment out the code below if you expect to crawl less than 50 pages
print "sleeping for: " + str(random_interval) + " seconds"
#now check if there is more pages
match =
if match == None:
more = False
url = ""'&amp;','&')
if __name__=="__main__":
# vim: ai ts=4 sts=4 et sw=4
Copy link

Don't know how much I can thank you and the original author. You saved my life (the part of it which was cached, at least) =))

Copy link

Glad that it helped you :)

Copy link

DomeDan commented Mar 30, 2012

Yeah thank you, and the original author of course!
Though I can inform you and other people that will try this out, that I was only able to download 63-67 pages from the cache then I got:
urllib2.HTTPError: HTTP Error 503: Service Unavailable
when checking the cache manually I see:
"Our systems have detected unusual traffic from your computer network. Please try your request again later."
I used time.sleep(random_interval) and tested this on 2 different ip-addresses
tried to change to random_interval=random.randrange(10,20,1) and tried on a third ip but that didn't help

Copy link

eqhes commented Aug 18, 2012

I've added "-u" param to pass the website URL from the command line:

Copy link

This is my version, to fix 503 error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment