Skip to content

Instantly share code, notes, and snippets.

@vinayak-mehta
Created September 22, 2018 11:54
Show Gist options
  • Save vinayak-mehta/b042ce545779cf7565e6530036e7a9de to your computer and use it in GitHub Desktop.
Save vinayak-mehta/b042ce545779cf7565e6530036e7a9de to your computer and use it in GitHub Desktop.
A Python2 script to extract tables from a PDF file using pdf-table-extract; saves tables as CSV files inside the current working directory.
#!/usr/bin/env python
"""
Usage: python pdf_table_extract.py <filename>
"""
import os
import sys
import pandas as pd
import pdftableextract as pdf
root, ext = os.path.splitext(os.path.basename(sys.argv[1]))
if ext.lower() != '.pdf':
raise ValueError('This script works only with PDF files.')
pages = ['1']
cells = [pdf.process_page(sys.argv[1], p) for p in pages]
cells = [cell for row in cells for cell in row]
tables = pdf.table_to_list(cells, pages)
for i, table in enumerate(tables[1:]):
df = pd.DataFrame(table)
out = '{}-page-1-table-{}.csv'.format(root, i + 1)
df.to_csv(out, index=False, quoting=1, encoding='utf-8')
@pidugusundeep
Copy link

I dont see a pip module named 'pdftableextract' unable to download it with
pip install pdftableextract

@divya1md
Copy link

Use pip install pdftabextract

@marmohamed
Copy link

marmohamed commented Oct 6, 2019

pip install pdf-table-extract here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment