Skip to content

Instantly share code, notes, and snippets.

@mudssrali
Created November 23, 2022 04:13
Show Gist options
  • Save mudssrali/6b0991a4370a6b71a984578355df6dfa to your computer and use it in GitHub Desktop.
Save mudssrali/6b0991a4370a6b71a984578355df6dfa to your computer and use it in GitHub Desktop.
Python script to convert Excel tables from PDF file to CSV
import pandas as pd
from tabula import read_pdf
# Specify file name
FILE_NAME = "sample.pdf"
# Total Pages
TOTAL_PAGES = 2
# Read the first page.
final_frame = read_pdf(FILE_NAME, pages="1")[0]
for page in range(1, TOTAL_PAGES):
data = read_pdf(FILE_NAME, pages=page)[0]
data.columns = final_frame.columns
final_frame = pd.concat([final_frame, data], ignore_index=True)
print("Page", page, "Size", len(final_frame))
# Write final frame (records) to CSV
final_frame.to_csv("output.csv")
# See the output
print(final_frame)
# Log the records length
print("Total Size (in Rows): ", len(final_frame))
@mudssrali
Copy link
Author

To run the script, install following packages

pandas
tabula-py

Then, run

python pdf-to-csv.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment