GSoC Project Proposal: Improve the OCR Subsystem
Email: saurabhshah.0410@gmail.com
The goal of my project was to make the hard subtitle extraction user friendly by making the subsystem independent of arbitrary user input parameters like sub_color
, conf_thresh
, luminance
, whiteness
etc. This would also extend CCExtractor's usage to extract burned in subtitles from video files containing multi color captions. The whole idea was to implement Neumann Mata's text detection algorithm which would meet the above objectives and also work with a reasonable time complexity and memory requirements.
All my commits to the mainstream master branch can be seen here.
All of my work related to GSoC project can be viewed here.
The compilation instructions will remain the same as before:
make ENABLE_HARDSUBX=yes ENABLE_OCR=yes
This command needs to be run from ccextractor/linux
directory.
The commands for this are not going to change much except that the user will now have to specify only the input video and other optional parameters whose description is given below. The -hardsubx
flag needs to be specified to the ccextractor executable in order to enable burned-in subtitle extraction.
-ocr_mode
: Set the OCR mode to either word-wise or textline-wise. e.g.-ocr_mode
word-min_sub_duration
: Specify the minimum duration(seconds) that a subtitle line must exist on the screen. Lower values give better timed results, but increase processing time. The default value is 0.5. e.g.-min_sub_duration
1.0(for duration of 1 sec)
Mat.c
,Mat.h
: contains initializers and other basic operators for the basic structMat
math.h
: handles all the basic mathematical operations on Points, Rectangles, sequences etc.erfilter.c
,erfilter.h
: consists of the functions required for the extraction of the text containing extremal regions.color.c
,color.h
: converts image from RGB type to HSV, LAB and GRAY formatsfloodfill.c
,contours.c
: contains functions to identify the contours around the text regionstypes.h
,storage.c
,MemStorage.c
: general functions to optimize memory requirementstrained_classifierNM1.xml
,trained_classifierNM2.xml
: trained classifiers for identifying character regions in the image
Last 3 to 4 months that I have worked on this project with CCExtractor have been a huge boost to my coding & communication skills. This project gave me a great opportunity to learn about the traditional as well as state of the art methods of text processing. Also, I'm very familiar with the source code, API and usage of opencv
because I had to read and understand the functions of opencv
whose text
module contains the same algorithm which I had proposed in my proposal. During the course of this project, I have also became comfortable with C++ after working on this project which has also helped me a lot in my campus interviews. Overall, this project was fun and a good learning experience for me.
I loved the working environment of CCExtractor and I would keep contributing in the future too on my own time. There is much scope of improvement on the code that I've implemented and I'll keep improving and updating it. I'll try to use more robust trained models and boost the accuracy and quality of the extracted subtitles. I'll also try implementing a CNN based approach and somehow make it work on an average computer.