Created
March 22, 2012 05:54
-
-
Save anandology/2156505 to your computer and use it in GitHub Desktop.
Liveweb Proxy Design
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Liveweb is a http proxy. | |
Request: | |
GET http://www.example.com/foo.html HTTP/1.1 | |
Header1: Value1 | |
Header2: Value2 | |
Response: | |
HTTP/1.1 200 OK | |
Content-Type: application/arc-gz | |
Content-Length: 1234 | |
<payload-of-arc.gz> | |
And for non 200 status case: | |
404 Response: | |
HTTP/1.1 404 Not Found | |
Content-Type: application/arc-gz | |
Content-Length: 1234 | |
<payload-of-arc.gz> | |
What is the server is not reachable? | |
?? |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
"""Simple implementation of liveweb proxy. | |
This implementation doesn't store the downloaded content. | |
""" | |
def get(url): | |
"""This is called on every proxy request.""" | |
return fetch(url) | |
def fetch(url): | |
# Download the URL from web and returns the contents of arc.gz | |
pass |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
"""The real implementation stores the downloaded content on disk | |
and uploads to ia cluster periodically. | |
Also memcache is used to handle duplicate requests. | |
""" | |
def get(url): | |
"""Fetches the url from web, saves it on to disk and returns | |
the content of arc.gz. | |
""" | |
# check in memcache if this URL is recently downloaded | |
h = md5sum(url) | |
location = memcache_client.get(h) | |
if location: | |
# if so, just read that file | |
filename, offset, size = location.split() | |
f = open(filename) | |
f.seek(offset) | |
content = f.read(size) | |
else: | |
content = fetch(url) | |
location = write_arc_file(prefix=h, content=content) | |
# update memcache so that subsequent requests to same URL | |
# can be handled just by a disk read | |
memcache_client.set(h, location) | |
return content | |
def write_arc_file(prefix, content): | |
path = get_path(prefix) | |
# write to tmp file | |
f = open(path + ".tmp", "w") | |
f.write(content) | |
f.close() | |
# rename after write is done | |
os.rename(path + ".tmp", path) | |
# Since the arc file has just a single record, offset is always zero | |
location = "%s 0 %s" % (path, len(content)) | |
return location | |
def get_path(prefix): | |
# Returns absolute path of the file to store the arc record. | |
# This can just use a single disk, or cycle through multiple disks or | |
# hash the prefix to get the disk | |
pass |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment