Skip to content

Instantly share code, notes, and snippets.

@nadya-p
Last active August 15, 2022 04:42
Show Gist options
  • Star 20 You must be signed in to star a gist
  • Fork 4 You must be signed in to fork a gist
  • Save nadya-p/373e1dc335293e490d89d00c895ea7b3 to your computer and use it in GitHub Desktop.
Save nadya-p/373e1dc335293e490d89d00c895ea7b3 to your computer and use it in GitHub Desktop.
Extract text contents of PDF files recursively
from tika import parser
import os
def extract_text_from_pdfs_recursively(dir):
for root, dirs, files in os.walk(dir):
for file in files:
path_to_pdf = os.path.join(root, file)
[stem, ext] = os.path.splitext(path_to_pdf)
if ext == '.pdf':
print("Processing " + path_to_pdf)
pdf_contents = parser.from_file(path_to_pdf)
path_to_txt = stem + '.txt'
with open(path_to_txt, 'w') as txt_file:
print("Writing contents to " + path_to_txt)
txt_file.write(pdf_contents['content'])
if __name__ == "__main__":
extract_text_from_pdfs_recursively(os.getcwd())
@yuripiffer
Copy link

Thank you soo much!!!

@adindarizky99
Copy link

Thank you
It really help me

@KTBL-JaschaJung
Copy link

Thank you, this was very helpful.
I ran into a UnicodeEncodeError, but could resolve it by specifying the encoding:
with open(path_to_txt, 'w', encoding="utf-8") as txt_file:

@systemsGit
Copy link

how can I append the text from all
the pdf to a csv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment