Skip to content

Instantly share code, notes, and snippets.

@RobinXL
Last active February 9, 2023 10:11
Show Gist options
  • Save RobinXL/ac890217f33438a4db01963aabe34bb2 to your computer and use it in GitHub Desktop.
Save RobinXL/ac890217f33438a4db01963aabe34bb2 to your computer and use it in GitHub Desktop.
Convert image with correct DPI for tesseract OCR purpose
from subprocess import Popen,PIPE, check_output
import os, sys
import tempfile
import uuid
import cv2
import pytesseract as pt
from PIL import Image
def convert_dpi(input_image):
path = tempfile.gettempdir()
filename = str(uuid.uuid4())[:8] + ".jpg"
filename_out = str(uuid.uuid4())[:8] + "_out.jpg"
image_path = os.path.join(path, filename)
image_path_out = os.path.join(path, filename_out)
cv2.imwrite(image_path, input_image)
proc_identify = check_output(['identify', '-format', "dpi:%y", image_path], stderr=PIPE)
dpi = str(proc_identify).replace('dpi:','')
print("image identify finish, DPI: "+dpi)
proc_convert = Popen(['convert', '-units', 'PixelsPerInch', image_path, '-density', dpi, image_path_out])
out = proc_convert.communicate()
p_status = proc_convert.wait()
print("image convert finished: ", out, p_status, image_path_out)
return image_path_out
imagepath = '.......'
img = cv2.imread(imagepath)
print("first attempt: ", pt.image_to_string(img))
converted_img = convert_dpi(img)
print("second attempt: ", pt.image_to_string(cv2.imread(converted_img)))
print("thrid attempt: ", pt.image_to_string(converted_img))
print("fourth attempt: ", pt.image_to_string(Image.open(converted_img)))
@RobinXL
Copy link
Author

RobinXL commented May 30, 2019

Tesseract requires correct DPI info in image's meta data to work in best shape.
This may helpful for images with incorrect DPI info.

  1. Get DPI info using imageMagick
  2. Using Imagemagick to convert image with correct unit(inches)
  3. Run pytesseract to test with input types of: 1. new image path, 2. numpy array, 3. PIL image
    The tests show that using direct image path and PIL image to run pytesseract give correct results.
    (don't forget to delete temp files)

@AbhishekYashSingh
Copy link

I am getting this error "[Errno 2] No such file or directory: 'identify".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment