Skip to content

Instantly share code, notes, and snippets.

@vinayak-mehta
Last active September 22, 2018 11:30
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save vinayak-mehta/9e134715793c70845d0c84fce264e0ec to your computer and use it in GitHub Desktop.
Save vinayak-mehta/9e134715793c70845d0c84fce264e0ec to your computer and use it in GitHub Desktop.
A Python2 script to extract tables from a PDF file using pdftables; saves tables as CSV files inside the current working directory.
#!/usr/bin/env python
"""
Usage: python pdftables_extract.py <filename>
"""
import os
import sys
import pandas as pd
from pdftables.pdf_document import PDFDocument
from pdftables.pdftables import page_to_tables
root, ext = os.path.splitext(os.path.basename(sys.argv[1]))
if ext.lower() != '.pdf':
raise ValueError('This script works only with PDF files.')
doc = PDFDocument.from_path(sys.argv[1])
for page_number, page in enumerate(doc.get_pages()):
tables = page_to_tables(page)
i = 1
for table in tables:
df = pd.DataFrame(table.data)
out = '{}-page-{}-table-{}.csv'.format(root, page_number + 1, i)
df.to_csv(out, index=False, quoting=1, encoding='utf-8')
i += 1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment