Created
May 8, 2020 15:49
-
-
Save afro-coder/8c2f5f94c8cdef5b3a367f1d42496c49 to your computer and use it in GitHub Desktop.
Using tika to process pdfs
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# This will remove the manual work of copying the file names | |
# You can run this in batches to processes it | |
# pip install --user tika to download the tika library | |
# the first run will download tika.jar | |
from tika import parser | |
filename="path_to_file" | |
#parse the pdf | |
raw_content = parser.from_file(filename) | |
#could add the PDF's in a folder loop through the filenames and store them in the last format needed | |
# raw_get.get('content') => Is a Dictionary key the split is to split after linebreaks | |
# The filter function will filter all empty or non-truthy content such as '' | |
# Cast the filter function to the list | |
process_content = list(filter(None,raw_content.get('content').split("\n"))) | |
for line in process_content: | |
print(line) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment