Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save ShubhamJain7/7986b9056376c9a70e744e0ab3100b85 to your computer and use it in GitHub Desktop.
Save ShubhamJain7/7986b9056376c9a70e744e0ab3100b85 to your computer and use it in GitHub Desktop.

Image captioning and Object detection add-ons for NVDA

GSoC 2020 | NV ACCESS | Shubham Dilip Jain

Final Report

Sample output produced using the object detection add-on, featuring the result "The image contains people, a dog, and a backpack." as embedded text and coloured boxes drawn around each of these objects in the image.


The internet today is rich in image-content, from entire websites like Instagram and Pinterest dedicated to curating and displaying images, to Facebook and Reddit that have large amounts of content in image form. Non-visual users find it challenging to navigate and use these websites for their intended purpose. The information in images, whether on the internet or stored locally, is also inaccessible to non-visual users.

NVDA or NonVisual Desktop Access is a free, open-source, portable screen reader for Microsoft Windows. It allows blind and vision impaired people to access and interact with the Windows operating system and many third party applications. There also exists an extensive ecosystem of NVDA add-ons that are additional packages that can be downloaded and installed into a user's copy of NVDA to enhance existing functionality or add additional features.

NVDA already includes an OCR that can recognize text within images, however, it currently lacks any functionality to describe image content and allow users to understand and interact with the various objects within images. This project aims to overcome these issues through two add-ons:

  1. An image Captioning add-on to generate descriptive captions for images on the user’s screen or those manually inputted by the user.
  2. An object detection add-on that draws bounding boxes around recognized objects and outputs the object label when the user’s pointer enters a bounding box.

The outputs of both these add-ons could be presented to the user using NVDA’s existing mechanisms such as voice or braille.

Work Done

The work done during this internship culminated as two add-ons. Development can be followed at the following GitHub repositories along with stable releases:

  1. Image captioning Addon: This add-on allows users to perform image captioning on image elements present on their screen and get a caption that describes the image in English.
  2. Object Detection Addon: This add-on allows users to perform object detection on image elements present on their screen and get results in the form of a sentence and bounding boxes drawn around the detected objects. Users can move their mouse pointer or finger (in case of touch screens) inside a bounding box to hear the object label.

The add-ons can be tested manually by following this testing checklist.

The add-ons are dependent on DLL’s that make use of ONNX Runtime to run inference on the respective ML models that were converted from PyTorch to ONNX format. The DLL’s can be found in the following GitHub repositories:

  1. Uses ONNX Runtime to run inference on an image using a model based on the Say-Look-Tell Image captioning model and returns a caption describing the input image.
  2. Uses ONNX Runtime to run inference on an image using the YOLOv3-Darknet model for object detection and returns object locations along with a sentence form of the result

Along with the above work, I also made some contributions to the NVDA codebase. These can be found below:

Weekly Blogs

During the course of this internship, I also wrote weekly blog posts. The rationale behind these posts was for me to keep track of what I did each week, provide my mentors with an overview of the progress made in a week and for anyone else reading the posts, they would serve as an insight into the thinking/decisions made and challenges faced while working the project. The blogs posts were written in the form of public GitHub Gists that can be found at the following links:

Future Scope

  • Provide support for more languages or translations.
  • Perform detection on all image elements in the browser automatically and cache results.
  • Store detection results in image metadata for images on the user’s filesystem.
  • Detect and present any text found in images being detected.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment