Skip to content

Instantly share code, notes, and snippets.

@minhajuddin
Created December 21, 2011 03:29
Show Gist options
  • Star 17 You must be signed in to star a gist
  • Fork 6 You must be signed in to fork a gist
  • Save minhajuddin/1504425 to your computer and use it in GitHub Desktop.
Save minhajuddin/1504425 to your computer and use it in GitHub Desktop.
Script to retrieve content from google cache
#!/usr/bin/python
import urllib
import urllib2
import re
import socket
import os
import time
import random
socket.setdefaulttimeout(30)
#adjust the site here
search_site="minhajuddin.com"
search_term="site:" + search_site
def main():
headers = {'User-Agent': 'Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.4) Gecko/20070515 Firefox/2.0.0.4'}
url = "http://www.google.com/search?q="+search_term
regex_cache = re.compile(r'<a href="([^"]*)"[^>]*>Cached</a>')
regex_next = re.compile('<a href="([^"]*)"[^>]*><span[^>]*>[^<]*</span><span[^>]*>Next</span></a>')
#this is the directory we will save files to
try:
os.mkdir(search_site)
except:
pass
counter = 0
pagenum = 0
more = True
while(more):
pagenum += 1
print "PAGE "+str(pagenum)+": "+url
req = urllib2.Request(url, None, headers)
page = urllib2.urlopen(req).read()
matches = regex_cache.findall(page)
print matches
for match in matches:
counter+=1
if not match.startswith("http"):
match = "http:" + match
tmp_req = urllib2.Request(match.replace('&amp;','&'), None, headers)
tmp_page = urllib2.urlopen(tmp_req).read()
print counter,": "+match
f = open(search_site + "/" + str(counter)+'.html','w')
f.write(tmp_page)
f.close()
#comment out the code below if you expect to crawl less than 50 pages
random_interval=random.randrange(1,10,1)
print "sleeping for: " + str(random_interval) + " seconds"
time.sleep(random_interval)
#now check if there is more pages
match = regex_next.search(page)
if match == None:
more = False
else:
url = "http://www.google.com"+match.group(1).replace('&amp;','&')
if __name__=="__main__":
main()
# vim: ai ts=4 sts=4 et sw=4
@egeozcan
Copy link

Don't know how much I can thank you and the original author. You saved my life (the part of it which was cached, at least) =))

@minhajuddin
Copy link
Author

Glad that it helped you :)

@DomeDan
Copy link

DomeDan commented Mar 30, 2012

Yeah thank you, and the original author of course!
Though I can inform you and other people that will try this out, that I was only able to download 63-67 pages from the cache then I got:
urllib2.HTTPError: HTTP Error 503: Service Unavailable
when checking the cache manually I see:
"Our systems have detected unusual traffic from your computer network. Please try your request again later."
I used time.sleep(random_interval) and tested this on 2 different ip-addresses
tried to change to random_interval=random.randrange(10,20,1) and tried on a third ip but that didn't help

@eqhes
Copy link

eqhes commented Aug 18, 2012

I've added "-u" param to pass the website URL from the command line: https://gist.github.com/3388552

@qtproduction
Copy link

This is my version, to fix 503 error https://gist.github.com/3787790

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment