Skip to content

Instantly share code, notes, and snippets.

@achikin
Created August 30, 2016 22:29
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save achikin/cb64a80ffe4fbf46da96dc03b7d0996c to your computer and use it in GitHub Desktop.
Save achikin/cb64a80ffe4fbf46da96dc03b7d0996c to your computer and use it in GitHub Desktop.
Docker file for doc2text
FROM ubuntu:16.04
WORKDIR /my/
RUN apt-get -qq -y update
RUN apt-get -qq -y install python
RUN apt-get -qq -y install python-pip tesseract-ocr python-pythonmagick libopencv-dev python-opencv
RUN pip install doc2text
ADD dtt.py /my/
ADD image.png /my/
CMD ["/usr/bin/python","/my/dtt.py"]
import doc2text
# Initialize the class.
doc = doc2text.Document()
# Read the file in. Currently accepts pdf, png, jpg, bmp, tiff.
# If reading a PDF, doc2text will split the PDF into its component pages.
doc.read('/my/image.png')
# Crop the pages down to estimated text regions, deskew, and optimize for OCR.
doc.process()
# Extract text from the pages.
doc.extract_text()
text = doc.get_text()
print text
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment