Skip to content

Instantly share code, notes, and snippets.

@Sigmame
Sigmame / pdf_table_with Tesseract
Created December 31, 2016 14:48 — forked from jaganadhg/pdf_table_with Tesseract
Extract Data from PDF table using Python Image. Image Magick and tesseract
#Refer http://craiget.com/extracting-table-data-from-pdfs-with-ocr/
import Image, ImageOps
import subprocess, sys, os, glob
# minimum run of adjacent pixels to call something a line
H_THRESH = 300
V_THRESH = 300
def get_hlines(pix, w, h):
"""Get start/end pixels of lines containing horizontal runs of at least THRESH black pix"""