Skip to content

Instantly share code, notes, and snippets.

View jd7h's full-sized avatar

Judith van Stegeren jd7h

View GitHub Profile
@nadya-p
nadya-p / pdf_to_text.py
Last active August 15, 2022 04:42
Extract text contents of PDF files recursively
from tika import parser
import os
def extract_text_from_pdfs_recursively(dir):
for root, dirs, files in os.walk(dir):
for file in files:
path_to_pdf = os.path.join(root, file)
[stem, ext] = os.path.splitext(path_to_pdf)
if ext == '.pdf':