Skip to content

Instantly share code, notes, and snippets.

@dcragusa
Created September 19, 2018 15:39
Show Gist options
  • Save dcragusa/e5147511a0a0fd1ebae4f098984931da to your computer and use it in GitHub Desktop.
Save dcragusa/e5147511a0a0fd1ebae4f098984931da to your computer and use it in GitHub Desktop.
Reads a PDF file into text
import subprocess
from typing import List as L
def pdf_to_text(fp: str, strip: bool = True) -> L[str]:
proc = subprocess.Popen(['pdftotext', '-layout', fp, '-'], stdout=subprocess.PIPE)
raw_output = proc.communicate()[0].decode()
if strip:
res = [i.strip() for i in raw_output.split('\n') if i]
else:
res = raw_output.split('\n')
return res
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment