Skip to content

Instantly share code, notes, and snippets.

@largocreatura
Created January 26, 2022 14:02
Show Gist options
  • Save largocreatura/7304bb20e3cafaf2061064f375a3692f to your computer and use it in GitHub Desktop.
Save largocreatura/7304bb20e3cafaf2061064f375a3692f to your computer and use it in GitHub Desktop.
Bash script to convert tables inserted in PDFs into CSVs in batch mode using the python module table-ocr
#!/bin/sh
# Convert each PDF to separate images per page
for pdf in ./*.pdf;do
python3 -m table_ocr.pdf_to_images $pdf;
done;
# Extract tables from each page in .png output
for images in $(find . -name "*.png"); do
python3 -m table_ocr.extract_tables $images;
done;
# Extract cells from each table in .png outputs
for tables in $(find . -name "table-*"); do
python3 -m table_ocr.extract_cells $tables;
done;
# Apply OCR to each image of each cell, output as .txt
for cells in $(find . -name "0*-*.png"); do
python3 -m table_ocr.ocr_image $cells;
done;
# Build CSVs with the different .txt files of each PDF analysed.
folders=();
lenght=0;
for folder in "$(find -type d -iname "*-*")";do
folders+=($folder);
done;
for i in "${folders[@]}"; do
python3 -m table_ocr.ocr_to_csv $(find $i/cells/ -name "*.gt.txt") > "$lenght".csv;
((++lenght));
done
@largocreatura
Copy link
Author

Hi @eihli I wrote this bash script using table-ocr. Basically it is used to export to individual csv files the tables that it extracts from the pdfs in batch mode. I would like to improve the nomenclature of the output csv files to know which pdf they correspond to. I hope to be able to integrate it soon. Thanks for such a great tool!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment