Skip to content

Instantly share code, notes, and snippets.

@zchen24
Created October 23, 2016 02:34
Show Gist options
  • Save zchen24/756db921bc1fcf1081e78c1078e2b5c7 to your computer and use it in GitHub Desktop.
Save zchen24/756db921bc1fcf1081e78c1078e2b5c7 to your computer and use it in GitHub Desktop.
Script used to extract all PDF file titles (from a conference) and put them into a txt file, which can be used to generate word cloud
#!/usr/bin/env python
from pdfrw import PdfReader
import glob
fobj = open('alltitles.txt', 'w') # output file
allpdf = glob.glob('./*.pdf') # assuming all PDF files in current dir
for fname in allpdf:
ipdf = PdfReader(fname)
title = ipdf.Info.get('/Title')
print 'file = ' + fname
print 'title = ' + title
fobj.write(title[1:-1] + ' ') # in my case title has "()"
# import ipdb; ipdb.set_trace()
fobj.close()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment