$ brew install imagemagick
$ brew install tesseract
$ ls -l1
Basic Infrastructure.pdf
extract.rb
$ ruby extract.rb 2> /dev/null
extracting page 1
.....................
extracting page 2
.....
$ ls -l1
Basic Infrastructure.pdf
extract.rb
data.csv
Output will be on data.csv
.
The script extract a image for each page (extract_page
).
For each row it will extract a image with the id value using convert
from imagemagick. And it will use tesseract
ocr to get the actual value. extract_id
.
For extracting the checkbox value, a subimage of the checkbox is generated and the average value of all colors is used. An unchecked checkbox will be mostly white. And a checked will have a bit of black so the average will decreate. This is done in extract_chk
.
get_id_crop_coordinates
and get_check_crop_coordinates
gives the cropping coordinates for the id value and the checkbox. This values differs if the page is the first one or not.