Skip to content

Instantly share code, notes, and snippets.

Last active August 15, 2022 04:42
  • Star 20 You must be signed in to star a gist
  • Fork 4 You must be signed in to fork a gist
Star You must be signed in to star a gist
What would you like to do?
Extract text contents of PDF files recursively
from tika import parser
import os
def extract_text_from_pdfs_recursively(dir):
for root, dirs, files in os.walk(dir):
for file in files:
path_to_pdf = os.path.join(root, file)
[stem, ext] = os.path.splitext(path_to_pdf)
if ext == '.pdf':
print("Processing " + path_to_pdf)
pdf_contents = parser.from_file(path_to_pdf)
path_to_txt = stem + '.txt'
with open(path_to_txt, 'w') as txt_file:
print("Writing contents to " + path_to_txt)
if __name__ == "__main__":
Copy link

Thank you soo much!!!

Copy link

Thank you
It really help me

Copy link

Thank you, this was very helpful.
I ran into a UnicodeEncodeError, but could resolve it by specifying the encoding:
with open(path_to_txt, 'w', encoding="utf-8") as txt_file:

Copy link

how can I append the text from all
the pdf to a csv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment