Skip to content

Instantly share code, notes, and snippets.

@anandology
Created March 22, 2012 05:54
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save anandology/2156505 to your computer and use it in GitHub Desktop.
Save anandology/2156505 to your computer and use it in GitHub Desktop.
Liveweb Proxy Design
Liveweb is a http proxy.
Request:
GET http://www.example.com/foo.html HTTP/1.1
Header1: Value1
Header2: Value2
Response:
HTTP/1.1 200 OK
Content-Type: application/arc-gz
Content-Length: 1234
<payload-of-arc.gz>
And for non 200 status case:
404 Response:
HTTP/1.1 404 Not Found
Content-Type: application/arc-gz
Content-Length: 1234
<payload-of-arc.gz>
What is the server is not reachable?
??
"""Simple implementation of liveweb proxy.
This implementation doesn't store the downloaded content.
"""
def get(url):
"""This is called on every proxy request."""
return fetch(url)
def fetch(url):
# Download the URL from web and returns the contents of arc.gz
pass
"""The real implementation stores the downloaded content on disk
and uploads to ia cluster periodically.
Also memcache is used to handle duplicate requests.
"""
def get(url):
"""Fetches the url from web, saves it on to disk and returns
the content of arc.gz.
"""
# check in memcache if this URL is recently downloaded
h = md5sum(url)
location = memcache_client.get(h)
if location:
# if so, just read that file
filename, offset, size = location.split()
f = open(filename)
f.seek(offset)
content = f.read(size)
else:
content = fetch(url)
location = write_arc_file(prefix=h, content=content)
# update memcache so that subsequent requests to same URL
# can be handled just by a disk read
memcache_client.set(h, location)
return content
def write_arc_file(prefix, content):
path = get_path(prefix)
# write to tmp file
f = open(path + ".tmp", "w")
f.write(content)
f.close()
# rename after write is done
os.rename(path + ".tmp", path)
# Since the arc file has just a single record, offset is always zero
location = "%s 0 %s" % (path, len(content))
return location
def get_path(prefix):
# Returns absolute path of the file to store the arc record.
# This can just use a single disk, or cycle through multiple disks or
# hash the prefix to get the disk
pass
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment