Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
A Python2 script to extract tables from a PDF file using pdf-table-extract; saves tables as CSV files inside the current working directory.
#!/usr/bin/env python
"""
Usage: python pdf_table_extract.py <filename>
"""
import os
import sys
import pandas as pd
import pdftableextract as pdf
root, ext = os.path.splitext(os.path.basename(sys.argv[1]))
if ext.lower() != '.pdf':
raise ValueError('This script works only with PDF files.')
pages = ['1']
cells = [pdf.process_page(sys.argv[1], p) for p in pages]
cells = [cell for row in cells for cell in row]
tables = pdf.table_to_list(cells, pages)
for i, table in enumerate(tables[1:]):
df = pd.DataFrame(table)
out = '{}-page-1-table-{}.csv'.format(root, i + 1)
df.to_csv(out, index=False, quoting=1, encoding='utf-8')
@pidugusundeep

This comment has been minimized.

Copy link

@pidugusundeep pidugusundeep commented Nov 30, 2018

I dont see a pip module named 'pdftableextract' unable to download it with
pip install pdftableextract

@divya1md

This comment has been minimized.

Copy link

@divya1md divya1md commented Aug 13, 2019

Use pip install pdftabextract

@MariamMohamedFawzy

This comment has been minimized.

Copy link

@MariamMohamedFawzy MariamMohamedFawzy commented Oct 6, 2019

pip install pdf-table-extract here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.