Skip to content

Instantly share code, notes, and snippets.

@salvadorgascon
Created February 23, 2024 12:48
Show Gist options
  • Save salvadorgascon/e3873140fc061f15a0be74bf4dbe5bea to your computer and use it in GitHub Desktop.
Save salvadorgascon/e3873140fc061f15a0be74bf4dbe5bea to your computer and use it in GitHub Desktop.
Python transformer to convert a bytes array containing PDF data into a string
import datetime
import os
from PyPDF2 import PdfReader
def PdfTextTransformer(pdf_binary, tmp_path):
print("Reading PDF")
filename = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
pdf_string = None
print("Saving PDF", tmp_path+'/'+filename+'.pdf')
with open(tmp_path+'/'+filename+'.pdf', 'wb') as f:
f.write(pdf_binary)
print("Parsing PDF")
pdf_reader = PdfReader(tmp_path+'/'+filename+'.pdf')
num_pages_pdf = len(pdf_reader.pages)
for x in range(0, num_pages_pdf-1):
page_object = pdf_reader.pages[x]
text = page_object.extract_text()
pdf_string += text
print("Removing PDF", tmp_path+'/'+filename+'.pdf')
os.remove(tmp_path+'/'+filename+'.pdf')
return pdf_string
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment