Created
January 26, 2022 14:02
-
-
Save largocreatura/7304bb20e3cafaf2061064f375a3692f to your computer and use it in GitHub Desktop.
Bash script to convert tables inserted in PDFs into CSVs in batch mode using the python module table-ocr
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/sh | |
# Convert each PDF to separate images per page | |
for pdf in ./*.pdf;do | |
python3 -m table_ocr.pdf_to_images $pdf; | |
done; | |
# Extract tables from each page in .png output | |
for images in $(find . -name "*.png"); do | |
python3 -m table_ocr.extract_tables $images; | |
done; | |
# Extract cells from each table in .png outputs | |
for tables in $(find . -name "table-*"); do | |
python3 -m table_ocr.extract_cells $tables; | |
done; | |
# Apply OCR to each image of each cell, output as .txt | |
for cells in $(find . -name "0*-*.png"); do | |
python3 -m table_ocr.ocr_image $cells; | |
done; | |
# Build CSVs with the different .txt files of each PDF analysed. | |
folders=(); | |
lenght=0; | |
for folder in "$(find -type d -iname "*-*")";do | |
folders+=($folder); | |
done; | |
for i in "${folders[@]}"; do | |
python3 -m table_ocr.ocr_to_csv $(find $i/cells/ -name "*.gt.txt") > "$lenght".csv; | |
((++lenght)); | |
done |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi @eihli I wrote this bash script using table-ocr. Basically it is used to export to individual csv files the tables that it extracts from the pdfs in batch mode. I would like to improve the nomenclature of the output csv files to know which pdf they correspond to. I hope to be able to integrate it soon. Thanks for such a great tool!