Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Extract TOC information from pdf file using pdfminer
#!/usr/bin/env python
# parse_toc.py
from pdfminer.pdfparser import PDFParser, PDFDocument
def parse(filename, maxlevel):
fp = open(filename, 'rb')
parser = PDFParser(fp)
doc = PDFDocument()
parser.set_document(doc)
doc.set_parser(parser)
outlines = doc.get_outlines()
for (level, title, dest, a, se) in outlines:
if level <= maxlevel:
print ' ' * level, title
if __name__ == '__main__':
import sys
if len(sys.argv) != 3:
print 'Usage: %s xxx.pdf level' % sys.argv[0]
sys.exit(2)
parse(sys.argv[1], int(sys.argv[2]))
@tilusnet

This comment has been minimized.

Copy link

@tilusnet tilusnet commented May 16, 2014

Hi sakti,

I adapted your gist to PDFMiner 20140328 here:
https://gist.github.com/tilusnet/407cd845a6b1cb939b34

Feel free to merge back, cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment