Skip to content

Instantly share code, notes, and snippets.

@yoavram
Last active September 12, 2022 12:09
Show Gist options
  • Save yoavram/4351598 to your computer and use it in GitHub Desktop.
Save yoavram/4351598 to your computer and use it in GitHub Desktop.
A python client to pdfx 1.0 a "Fully-automated PDF-to-XML conversion of scientific text" (http://pdfx.cs.man.ac.uk/). Written to be used in Markx, a scientific-oriented Markdown editor (https://github.com/yoavram/markx).
# pdfx usage: http://pdfx.cs.man.ac.uk/usage
# requests docs: http://docs.python-requests.org/en/latest/user/quickstart/#post-a-multipart-encoded-file
import requests # get it from http://python-requests.org or do 'pip install requests'
url = "http://pdfx.cs.man.ac.uk"
def pypdfx(filename):
'''
Filename is a name of a pdf file WITHOUT the extension
The function will print messages, including the status code,
and will write the XML file to <filename>.xml
'''
fin = open(filename + '.pdf', 'rb')
files = {'file': fin}
try:
print 'Sending', filename, 'to', url
r = requests.post(url, files=files, headers={'Content-Type':'application/pdf'})
print 'Got status code', r.status_code
finally:
fin.close()
fout = open(filename + '.xml', 'w')
fout.write(r.content)
fout.close()
print 'Written to', filename + '.xml'
if __name__ == '__main__':
# self promotion - get the pdf file here: http://onlinelibrary.wiley.com/doi/10.1111/j.1558-5646.2012.01576.x/abstract
filename = 'Ram and Hadany 2012'
pypdfx(filename)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment