Skip to content

Instantly share code, notes, and snippets.

View ShubhamJain7's full-sized avatar

Shubham Jain ShubhamJain7

  • Bengaluru, India
View GitHub Profile

Image captioning and Object detection add-ons for NVDA

GSoC 2020 | NV ACCESS | Shubham Dilip Jain

Final Report

Sample output produced using the object detection add-on, featuring the result "The image contains people, a dog, and a backpack." as embedded text and coloured boxes drawn around each of these objects in the image.

Introduction

The internet today is rich in image-content, from entire websites like Instagram and Pinterest dedicated to curating and displaying images, to Facebook and Reddit that have large amounts of content in image form. Non-visual users find it challenging to navigate and use these websites for their intended purpose. The information in images, whether on the internet or stored locally, is also inacc

Week 12

22 August, 2020

Hello again! I couldn't write a blog the previous week or even get much work done because I had exams. The most significant changes to the add-on were boiling down two gestures into one. Formerly, there were two separate gestures for getting results in the form of a spoken message and then in a virtual result window. It's much easier to have just a single gesture that you can press a different number of times instead of two since they do the same thing. Getting this functionality was frustratingly hard. Especially since I had worked with it before. Single vs. double gesture presses were used for filtering/not filtering non graphic elements before that was moved into the settings. NVDA's scriptHandler.py made it rather simple. All you had to do was call scriptHandler.getLastScriptRepeatCount() and do different things based on the value returned. So the code would look something like this:

scriptCount = scriptHandler.getLastScriptRepeatCount()
if scriptCount == 0:
  

Testing

Installation

  • Add-on installs without any errors
    • Add-on can be installed from add-ons manager menu
    • Add-on can be directly installed from the .nvda-addon file

Normal function

  • Add-on responds to the gesture set by the user

Week 11

8 August, 2020

Hello!😄 This blog is going to be a little different as I won't just talk about the things I did this week, but also some of the things I've learnt during this experience.

The biggest and most significant change (or rather feature) I made to the add-ons was caching results. I decided that the YOLOv3 416 model is the best model to ship with the object detection add-on given its accuracy, size and latency. However, it was significantly slower than the tiny-YOLOv3 model. This meant that both the object detection and image captioning add-ons took a minimum of 5 seconds to produce results. It made no sense to wait for the same result again so I needed to cache results for the session. Initially, I tried to determine if the detection process was started on the same image by using the navigatorObject. However, this was a dead-end since they are not uniquely identifiable. So I tried creating a hash out of the image and that worked like a charm! here's the code for it:

rowHashe

Week 10

2 August, 2020

Hello!😄 I spent this week working on adding the “bounding box” feature to the Object Detection addon. The idea is to draw boxes around the detected objects and when users move the mouse pointer or their finger (in case of touchscreen computers), the object label would be announced. This allows users to understand not just the objects in the image but the relative positions of those objects in the image. For example, "a man above a bicycle" and "a man beside a bicycle" paint very different pictures.

I first attempted to achieve this feature using the wxPython library. I came up with three solutions that you can find here. I will not get into explaining these solutions because I didn't end up using them. However, the reasons for not implementing them are relevant. All three of the solutions had some drawbacks and, I felt that they didn't make for good user experience. The biggest reason though is that NV

@ShubhamJain7
ShubhamJain7 / screenDC.py
Created August 2, 2020 16:33
wxPython bounding boxes
import wx
class Frame(wx.Frame):
def __init__(self,boxes):
super(Frame, self).__init__(None, title="Bounding boxes")
self.boxes = boxes
self.boundingBoxes = []
self.status = []

Week 9

25 July, 2020

Hello again! Let's start with a simple but dangerous mistake I made last week. Turns out, copying an std::string object character by character into a char * array isn't a good thing to do. This technique may seem simple and innocent enough but is prone to security risks like buffer overflow attacks. Using an inbuilt library function to accomplish these tasks is always safer! Well, almost always. strcpy() was replaced with strcpy_s() since it is safer. So I switched over to using that and we were ready to go!

I spent most of this week working around the restrictions NVDA's contentRecog module has. The module seems to have been written with just the OCR in mind and so it doesn't lend itself too well to any other kind of add-on/feature. For example, it is hard-coded to present recognition results in the form of a virtual window. Another issue is that the recognition result itself isn't very accessible so it cannot be stored for processing or any other use. For my add-

Week 8

18 July, 2020

Lots of coding this past week! I started out trying to fix most of the issues that the add-on release had. The biggest of which was that users were getting the “Cannot identify any objects in the image” more often than useful results. After looking into it a little deeper, I discovered three potential problems that might be contributing to this issue.

  1. The model really couldn't identify any objects in the image. This was the most obvious one but also one over which I had no control. The release was shipped with the tiniest(lol) of the 3 models, Tiny-YOLOv3. This was by choice since we didn't want anyone from being turned away from testing the add-on because of the download size. Of course, choosing a small model means the results won't be too good.
  2. Users were trying to run object detection on non-image elements on their screen. This seems a little unlikely but it was a case that needed to be handled anyway. Unfortunately, contentRecog.recogUi.recognizeNavigatorObject di

Week 7

11 July, 2020

This was a very slow week, more thinking and decision making and less coding. There were a few issues with the file structure of the add-on. Turns out NVDA expects every Python file in the globalPlugins directory to contain an instance of globalPluginHandler.GlobalPlugin. After spending an embarrassingly long amount of time (read two days). I was finally able to solve it by packaging all the code as a Python package. You can see the code here.

After discussions with my mentor Reef, I came to realize that I may have been focused on the wrong things. With such projects, it is quite easy to lose track of your initial goals and fly off on a tangent. I started worrying about things like the size and speed of the object detection models and lost focus on the real goal, to make an add-on that is useful and userfriendly for non-visual users. I wished to release the add-on and get feedback on which model the users thin

Week 6

4 July, 2020

Last week, we created a DLL for the YOLOv3 darknet models and a client that could use it in Python. I started this week by using the outputs of the model, which are in the form

struct Detection {
    int classId,
    float probability,
 int x1,