Skip to content

Instantly share code, notes, and snippets.

Last active April 18, 2020 01:58
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
Star You must be signed in to star a gist
What would you like to do?
Parsing number of COVID-19 cases from MN Department of Corrections website, which provides data as an image of a table.
Used image from 4/17/2020 at
Output looks like:
>> python
Full string is:
Total 59 16 40 3 37 10 2 o
Parsed 16 positive cases from mn_prison_covid_table_from_website.jpeg
import cv2
import pytesseract
def parse_num_cases_from_image(image_path):
full_img = cv2.imread(image_path)
height, width, channels = full_img.shape
# Thanks to
# The last row, which contains totals, is ~30px from the bottom
bottom_30px = full_img[height - 30:height, 0:width]
# trial and error to get this combination working.
# --psm 6 - "Assume a single uniform block of text."
config = '--psm 6'
img_as_string = pytesseract.image_to_string(bottom_30px, config=config)
print('Full string is:')
# looks like "Total 59 16 ..."
# and "Confirmed Positive" is second 3rd column
num_confirmed_positive = img_as_string.split()[2]
print('Parsed {} confirmed positive cases from {}'.format(num_confirmed_positive, image_path))
if __name__ == '__main__':
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment