Skip to content

Instantly share code, notes, and snippets.

@seanjensengrey
Created March 20, 2014 18:40
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save seanjensengrey/9670893 to your computer and use it in GitHub Desktop.
Save seanjensengrey/9670893 to your computer and use it in GitHub Desktop.
Content-Disposition for CiteSeerX
import SimpleHTTPServer
import SocketServer
import BaseHTTPServer
import re
import os
import os.path
import shutil
__doc__ = """
<pre>
This server reproduces a small problem with how pdfs are displayed and downloaded
from <a href="http://citeseerx.ist.psu.edu/index">citeseer</a>. When a user saves a document the filename is nearly always "download.pdf".
Setting the http header,
Content-Disposition: inline; filename="10.1.1.16.2427.pdf"
Allows for the content to be displayed inline and saved with the correct filename.
For example
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.2427
links to a pdf
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.16.2427&rep=rep1&type=pdf
with the server supplying the current headers on download
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Content-Type: application/pdf
Content-Length: 213454
Date: Thu, 20 Mar 2014 18:21:58 GMT
which displays inline with current dev browsers as of 20 March 2014
Google Chrome 35.0.1897.3 dev
Firefox Aurora 29.0a2
Safari 7.0.2
When the user saves the document from the inline display, all of the above browsers fill
the file dialog with filename of "download.pdf" which forces the user to manually rename
the file or have many download.pdf, download (1).pdf files on their machine. It would
be nice if the file was saved with its DOI.
This server reproduces the bug and supplies a fix.
1) Download a pdf from citeseer into the same directory as CiteSeerDownloadRepro.py
wget -O 10.1.1.16.2427.pdf 'http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.16.2427&rep=rep1&type=pdf'
2) Launch server
python CiteSeerDownloadRepro.py
3) Repro existing download behavior
<a href="http://localhost:4000/viewdoc/download?doi=10.1.1.16.2427&rep=rep1&type=pdf">http://localhost:4000/viewdoc/download?doi=10.1.1.16.2427&rep=rep1&type=pdf</a>
The document views inline, but saves as download.pdf
4) Add the content disposition header by appending FIXBUG do the url
<a href="http://localhost:4000/viewdoc/download?doi=10.1.1.16.2427&rep=rep1&type=pdf&FIXBUG">http://localhost:4000/viewdoc/download?doi=10.1.1.16.2427&rep=rep1&type=pdf&FIXBUG</a>
The document views inline, but saves as 10.1.1.16.2427.pdf
with the repro server supplying the following headers
HTTP/1.0 200 OK
Server: CiteSeerBugRepro/0.1.1 Python/2.7.6
Date: Thu, 20 Mar 2014 18:26:54 GMT
Content-Type: application/pdf
Content-Length: 213454
Content-Disposition: inline; filename="10.1.1.16.2427.pdf"
And the world rejoices.
</pre>
"""
PORT = 4000
__version__ = "0.1.1"
class ReuseTCPServer(SocketServer.TCPServer):
allow_reuse_address = True
class CiteSeerHandler(BaseHTTPServer.BaseHTTPRequestHandler):
server_version = "CiteSeerBugRepro/" + __version__
def send_header(self, keyword, value):
print keyword, value
BaseHTTPServer.BaseHTTPRequestHandler.send_header(self,keyword,value)
def do_GET(self):
print self.path
match_doi = re.search("doi=(\d+)\.(\d+)\.(\d+)\.(\d+)\.(\d+)",self.path)
if match_doi:
doi = '.'.join(match_doi.groups())
print doi
pdf_path = doi + ".pdf"
if os.path.exists(pdf_path):
self.send_response(200)
self.send_header("Content-Type","application/pdf")
f = open(pdf_path,'rb')
fs = os.fstat(f.fileno())
self.send_header("Content-Length", str(fs[6]))
if 'FIXBUG' in self.path:
self.send_header("Content-Disposition",r'''inline; filename="%s"''' % (pdf_path,))
self.end_headers()
shutil.copyfileobj(f,self.wfile)
else:
self.send_response(404)
self.end_headers()
self.wfile.write("No pdf file found!")
else:
self.send_response(200)
self.send_header("Content-Type","text/html")
self.end_headers()
self.wfile.write(__doc__)
httpd = ReuseTCPServer(("", PORT), CiteSeerHandler)
print "open this url and follow the instructions", PORT
print "http://localhost:" + str(PORT)
httpd.serve_forever()
@seanjensengrey
Copy link
Author

This issue has been fixed from an earlier mail I sent, SeerLabs/CiteSeerX@3555ff3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment