Skip to content

Instantly share code, notes, and snippets.

@dennisdv1
Last active July 22, 2021 14:33
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dennisdv1/9bb662ae579932c97530d8c0b72da9b3 to your computer and use it in GitHub Desktop.
Save dennisdv1/9bb662ae579932c97530d8c0b72da9b3 to your computer and use it in GitHub Desktop.
import textract
import PyPDF2
def extract_text_from_pdf(file):
'''Opens and reads in a PDF file from path'''
fileReader = PyPDF2.PdfFileReader(open(file,'rb'))
page_count = fileReader.getNumPages()
text = [fileReader.getPage(i).extractText() for i in range(page_count)]
return str(text).replace("\\n", "")
def extract_text_from_word(filepath):
'''Opens en reads in a .doc or .docx file from path'''
txt = textract.process(filepath).decode('utf-8')
return txt.replace('\n', ' ').replace('\t', ' ')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment