Skip to content

Instantly share code, notes, and snippets.

Created October 21, 2020 02:15
What would you like to do?
get_info() function reads the image using openCV and performs thresholding, dilation, noise removal, and
contouring to finally retrieve bounding boxes from the contour.
Below are some of the additional available functions from openCV for preprocessing:
Median filter: median filter blurs out noises by taking the medium from a set of pixels
Dilation and erosion: dilation adds pixels to boundaries of pixels, erosion removes it
cv2.opening() #This is an erosion followed by a dilation
def get_info(path):
fontScale = 0.5
fontColor = (255,0,0)
lineType = 1
image = cv2.imread(path)
height,width,channel = image.shape
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
T = threshold_local(gray, 15, offset = 6, method = "gaussian") # generic, mean, median, gaussian
thresh = (gray > T).astype("uint8") * 255
thresh = ~thresh
kernel =np.ones((1,1), np.uint8)
ero = cv2.erode(thresh, kernel, iterations= 1)
img_dilation = cv2.dilate(ero, kernel, iterations=1)
# Remove noise
nlabels, labels, stats, centroids = cv2.connectedComponentsWithStats(img_dilation, None, None, None, 8, cv2.CV_32S)
sizes = stats[1:, -1] #get CC_STAT_AREA component
final = np.zeros((labels.shape), np.uint8)
for i in range(0, nlabels - 1):
if sizes[i] >= 10: #filter small dotted regions
final[labels == i + 1] = 255
#Find contours
kern = np.ones((5,15), np.uint8)
img_dilation = cv2.dilate(final, kern, iterations = 1)
contours, hierarchy = cv2.findContours(img_dilation, cv2.RETR_LIST, cv2.CHAIN_APPROX_NONE)
# Map contours to bounding rectangles, using bounding_rect property
rects = map(lambda c: cv2.boundingRect(c), contours)
# Sort rects by top-left x (rect.x ==
sorted_rects = sorted(rects, key =lambda r: r[0])
sorted_rects = sorted(sorted_rects, key =lambda r: r[1])
for rect in sorted_rects:
x,y,w,h = rect
if(w<20 or h<20):
temp = image[y:y+h, x:x+w]
temp = cv2.cvtColor(temp, cv2.COLOR_BGR2RGB)
hi = pytesseract.image_to_data(temp, config=r'--psm 6')
hi = hi.split()
ind = 22
if (ind>len(hi)):
etfo=etfo+" "
return etfo
Copy link

ajeet28 commented Oct 30, 2020

This is great solution, I have images saved in pdf files with multiple pages. can you suggest how to apply this on the pdfs which are basically scanned documents save as pdf.

Copy link

pil_images = pdf2image.convert_from_path(your_multipage_pdf_file_here, dpi=200, output_folder='/tmp', first_page=None, last_page=None, fmt='JPG', thread_count=1, userpw=None, use_cropbox=False, strict=False)

this is just the meat, I let your figure out the rest.

Copy link

Here where is the image path, you're not assigning any image or pdf to the path? Then how do you take the tested image or pdf file?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment