Skip to content

Instantly share code, notes, and snippets.

@ShubhamJain7
Last active May 30, 2020 11:32
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ShubhamJain7/fb00b900108d30b8f197acbc0dffeece to your computer and use it in GitHub Desktop.
Save ShubhamJain7/fb00b900108d30b8f197acbc0dffeece to your computer and use it in GitHub Desktop.

Week 1

30 May, 2020

Hello!

This is the start of a blog that documents my experience and progress as a GSoC 2020 student with NV ACCESS while working on my project titled Image captioning and Object recognition modules for NVDA! The purpose of these blog posts is to not only keep track of the work being done each week but also to serve as a guide to any future GSoC students! (By the way, feel free to reach out to me about it!)

The community bonding period was a blast! My mentors, Michael Curran and Reef Turner, along with the rest of the NVDA community were incredibly warm and helpful to me! The community welcomed me and we discussed my project at great length. Some of the community members gave me helpful pointers (How did I not think of adding an OCR along with the object recognition module 🤦‍♂️) and referred me to some previous work that hoped to accomplish the same goals! I spent most of the community bonding period breaking my head over what seemed to be a deceptively simple bug. A ZeroDivisionError of all things! The bug led me down into a deep rabbit hole where I tested different solutions and explored the codebase. Although exhausting, the exercise helped me understand the codebase, the development workflow and the concept of writing code for the future! To all the new developers and future GSoC students: The community bonding period is invaluable so please make the best use of it! Remember to rubber-duck🦆 and I hope you feel the same joy I felt when my first Pull Request got accepted🎉🎉🎉.

The first week of the coding period was a roller-coaster ride, lots of ups and a few downs. We decided that the best course of action would be to look for pre-trained image captioning models, test them and choose the best one to implement. Sounds easy right? Nope. All of the models available had dependency issues (aka the bane of all developers 😓)! They were either written in Python 2 or used an ancient version of TensorFlow (ok not that ancient in the grand scheme of things but ancient enough to be 50% depreciated). With a lot of patience, perseverance and mostly stack overflow-ing, I finally got 3 models to work! You can find them here:

  1. https://github.com/DeepRNN/image_captioning
  2. https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Image-Captioning
  3. https://github.com/yunjey/pytorch-tutorial/tree/master/tutorials/03-advanced/image_captioning

These models work amazingly well (almost like magic) but their size is a concern. Looks like I'll be making add-ons for NVDA and not integrated sub-modules as planned😕. (Also, I ended up favoring PyTorch based models because PyTorch >>>>> TensorFlow).

The object detection part was extremely easy! Having worked with darknet/YOLOv3🔥🔥🔥 before, I didn't need to do much to get it working again. Reef had the incredible idea of using the objects detected by darknet to form a basic description of what the image contains. And the, if the user wishes for a more detailed description, an actual image captioning model could be used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment