This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
""" | |
Import Libraries: The script begins by importing several Python libraries required for processing different types of files, including os, fnmatch, docx, pptx, PyPDF2, and openpyxl. | |
Chunk Size Configuration: It defines a constant CHUNK_SIZE_IN_WORDS, which determines the maximum number of words in each text chunk. This value can be modified as needed to control the size of the output text chunks. | |
File Content Extraction Functions: The script defines functions to extract text content from different file types: | |
read_word_file: Extracts text from .docx files. | |
read_pptx_file: Extracts text from .pptx files, including text from slides and shapes. | |
read_pdf_file: Extracts text from .pdf files using PyPDF2. |