Created
March 20, 2014 18:40
-
-
Save seanjensengrey/9670893 to your computer and use it in GitHub Desktop.
Content-Disposition for CiteSeerX
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import SimpleHTTPServer | |
import SocketServer | |
import BaseHTTPServer | |
import re | |
import os | |
import os.path | |
import shutil | |
__doc__ = """ | |
<pre> | |
This server reproduces a small problem with how pdfs are displayed and downloaded | |
from <a href="http://citeseerx.ist.psu.edu/index">citeseer</a>. When a user saves a document the filename is nearly always "download.pdf". | |
Setting the http header, | |
Content-Disposition: inline; filename="10.1.1.16.2427.pdf" | |
Allows for the content to be displayed inline and saved with the correct filename. | |
For example | |
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.2427 | |
links to a pdf | |
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.16.2427&rep=rep1&type=pdf | |
with the server supplying the current headers on download | |
HTTP/1.1 200 OK | |
Server: Apache-Coyote/1.1 | |
Content-Type: application/pdf | |
Content-Length: 213454 | |
Date: Thu, 20 Mar 2014 18:21:58 GMT | |
which displays inline with current dev browsers as of 20 March 2014 | |
Google Chrome 35.0.1897.3 dev | |
Firefox Aurora 29.0a2 | |
Safari 7.0.2 | |
When the user saves the document from the inline display, all of the above browsers fill | |
the file dialog with filename of "download.pdf" which forces the user to manually rename | |
the file or have many download.pdf, download (1).pdf files on their machine. It would | |
be nice if the file was saved with its DOI. | |
This server reproduces the bug and supplies a fix. | |
1) Download a pdf from citeseer into the same directory as CiteSeerDownloadRepro.py | |
wget -O 10.1.1.16.2427.pdf 'http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.16.2427&rep=rep1&type=pdf' | |
2) Launch server | |
python CiteSeerDownloadRepro.py | |
3) Repro existing download behavior | |
<a href="http://localhost:4000/viewdoc/download?doi=10.1.1.16.2427&rep=rep1&type=pdf">http://localhost:4000/viewdoc/download?doi=10.1.1.16.2427&rep=rep1&type=pdf</a> | |
The document views inline, but saves as download.pdf | |
4) Add the content disposition header by appending FIXBUG do the url | |
<a href="http://localhost:4000/viewdoc/download?doi=10.1.1.16.2427&rep=rep1&type=pdf&FIXBUG">http://localhost:4000/viewdoc/download?doi=10.1.1.16.2427&rep=rep1&type=pdf&FIXBUG</a> | |
The document views inline, but saves as 10.1.1.16.2427.pdf | |
with the repro server supplying the following headers | |
HTTP/1.0 200 OK | |
Server: CiteSeerBugRepro/0.1.1 Python/2.7.6 | |
Date: Thu, 20 Mar 2014 18:26:54 GMT | |
Content-Type: application/pdf | |
Content-Length: 213454 | |
Content-Disposition: inline; filename="10.1.1.16.2427.pdf" | |
And the world rejoices. | |
</pre> | |
""" | |
PORT = 4000 | |
__version__ = "0.1.1" | |
class ReuseTCPServer(SocketServer.TCPServer): | |
allow_reuse_address = True | |
class CiteSeerHandler(BaseHTTPServer.BaseHTTPRequestHandler): | |
server_version = "CiteSeerBugRepro/" + __version__ | |
def send_header(self, keyword, value): | |
print keyword, value | |
BaseHTTPServer.BaseHTTPRequestHandler.send_header(self,keyword,value) | |
def do_GET(self): | |
print self.path | |
match_doi = re.search("doi=(\d+)\.(\d+)\.(\d+)\.(\d+)\.(\d+)",self.path) | |
if match_doi: | |
doi = '.'.join(match_doi.groups()) | |
print doi | |
pdf_path = doi + ".pdf" | |
if os.path.exists(pdf_path): | |
self.send_response(200) | |
self.send_header("Content-Type","application/pdf") | |
f = open(pdf_path,'rb') | |
fs = os.fstat(f.fileno()) | |
self.send_header("Content-Length", str(fs[6])) | |
if 'FIXBUG' in self.path: | |
self.send_header("Content-Disposition",r'''inline; filename="%s"''' % (pdf_path,)) | |
self.end_headers() | |
shutil.copyfileobj(f,self.wfile) | |
else: | |
self.send_response(404) | |
self.end_headers() | |
self.wfile.write("No pdf file found!") | |
else: | |
self.send_response(200) | |
self.send_header("Content-Type","text/html") | |
self.end_headers() | |
self.wfile.write(__doc__) | |
httpd = ReuseTCPServer(("", PORT), CiteSeerHandler) | |
print "open this url and follow the instructions", PORT | |
print "http://localhost:" + str(PORT) | |
httpd.serve_forever() | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This issue has been fixed from an earlier mail I sent, SeerLabs/CiteSeerX@3555ff3