Shubham Jain ShubhamJain7

## Image captioning and Object detection add-ons for NVDA - Final Report.md

      
              2 files
            
          
              2 forks
            
          
              0 comments
            
          
              2 stars
            
          
                ShubhamJain7
                / Image captioning and Object detection add-ons for NVDA - Final Report.md
            
            
              Last active
              March 6, 2023 21:35
            
          
    Image captioning and Object detection add-ons for NVDA

GSoC 2020 | NV ACCESS | Shubham Dilip Jain
Final Report


Introduction

The internet today is rich in image-content, from entire websites like Instagram and Pinterest dedicated to curating and displaying images, to Facebook and Reddit that have large amounts of content in image form. Non-visual users find it challenging to navigate and use these websites for their intended purpose. The information in images, whether on the internet or stored locally, is also inacc

  
## GSoC Blog: week 12.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ShubhamJain7
                / GSoC Blog: week 12.md
            
            
              Created
              August 22, 2020 14:46
            
          
    Week 12

22 August, 2020


Hello again! I couldn't write a blog the previous week or even get much work done because I had exams.
The most significant changes to the add-on were boiling down two gestures into one. Formerly, there were two separate gestures for getting results in the form of a spoken message and then in a virtual result window. It's much easier to have just a single gesture that you can press a different number of times instead of two since they do the same thing.
Getting this functionality was frustratingly hard. Especially since I had worked with it before. Single vs. double gesture presses were used for filtering/not filtering non graphic elements before that was moved into the settings. NVDA's scriptHandler.py made it rather simple. All you had to do was call scriptHandler.getLastScriptRepeatCount() and do different things based on the value returned. So the code would look something like this:
scriptCount = scriptHandler.getLastScriptRepeatCount()
if scriptCount == 0:
  

## TestingChecklist.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ShubhamJain7
                / TestingChecklist.md
            
            
              Last active
              August 27, 2020 11:23
            
          
    Testing

Installation


 Add-on installs without any errors

 Add-on can be installed from add-ons manager menu
 Add-on can be directly installed from  the .nvda-addon file


Normal function


 Add-on responds to the gesture set by the user


## GSoC Blog: week 11.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ShubhamJain7
                / GSoC Blog: week 11.md
            
            
              Last active
              August 8, 2020 14:24
            
          
    Week 11

8 August, 2020


Hello!😄
This blog is going to be a little different as I won't just talk about the things I did this week, but also some of the things I've learnt during this experience.
The biggest and most significant change (or rather feature) I made to the add-ons was caching results. I decided that the YOLOv3 416 model is the best model to ship with the object detection add-on given its accuracy, size and latency. However, it was significantly slower than the tiny-YOLOv3 model. This meant that both the object detection and image captioning add-ons took a minimum of 5 seconds to produce results. It made no sense to wait for the same result again so I needed to cache results for the session. Initially, I tried to determine if the detection process was started on the same image by using the navigatorObject. However, this was a dead-end since they are not uniquely identifiable. So I tried creating a hash out of the image and that worked like a charm! here's the code for it:
rowHashe


## GSoC Blog: week 10.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ShubhamJain7
                / GSoC Blog: week 10.md
            
            
              Last active
              August 2, 2020 17:16
            
          
    Week 10

2 August, 2020


Hello!😄
I spent this week working on adding the “bounding box” feature to the Object Detection addon. The idea is to draw boxes around the detected objects and when users move the mouse pointer or their finger (in case of touchscreen computers), the object label would be announced. This allows users to understand not just the objects in the image but the relative positions of those objects in the image. For example, "a man above a bicycle" and "a man beside a bicycle" paint very different pictures.
I first attempted to achieve this feature using the wxPython library. I came up with three solutions that you can find here. I will not get into explaining these solutions because I didn't end up using them. However, the reasons for not implementing them are relevant. All three of the solutions had some drawbacks and, I felt that they didn't make for good user experience. The biggest reason though is that NV

  
## screenDC.py
import wx

class Frame(wx.Frame):
	def __init__(self,boxes):
		super(Frame, self).__init__(None, title="Bounding boxes")

		self.boxes = boxes
		self.boundingBoxes = []
		self.status = []


## GSoC Blog: week 9.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ShubhamJain7
                / GSoC Blog: week 9.md
            
            
              Created
              July 25, 2020 18:09
            
          
    Week 9

25 July, 2020


Hello again!
Let's start with a simple but dangerous mistake I made last week. Turns out, copying an std::string object character by character into a char * array isn't a good thing to do. This technique may seem simple and innocent enough but is prone to security risks like buffer overflow attacks. Using an inbuilt library function to accomplish these tasks is always safer! Well, almost always. strcpy() was replaced with strcpy_s() since it is safer. So I switched over to using that and we were ready to go!
I spent most of this week working around the restrictions NVDA's contentRecog module has. The module seems to have been written with just the OCR in mind and so it doesn't lend itself too well to any other kind of add-on/feature. For example, it is hard-coded to present recognition results in the form of a virtual window. Another issue is that the recognition result itself isn't very accessible so it cannot be stored for processing or any other use. For my add-

  
## GSoC Blog: week 8.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ShubhamJain7
                / GSoC Blog: week 8.md
            
            
              Created
              July 18, 2020 14:34
            
          
    Week 8

18 July, 2020


Lots of coding this past week! I started out trying to fix most of the issues that the add-on release had. The biggest of which was that users were getting the “Cannot identify any objects in the image” more often than useful results. After looking into it a little deeper, I discovered three potential problems that might be contributing to this issue.

The model really couldn't identify any objects in the image. This was the most obvious one but also one over which I had no control. The release was shipped with the tiniest(lol) of the 3 models, Tiny-YOLOv3. This was by choice since we didn't want anyone from being turned away from testing the add-on because of the download size. Of course, choosing a small model means the results won't be too good.
Users were trying to run object detection on non-image elements on their screen. This seems a little unlikely but it was a case that needed to be handled anyway. Unfortunately, contentRecog.recogUi.recognizeNavigatorObject di


## GSoC Blog: week 7.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ShubhamJain7
                / GSoC Blog: week 7.md
            
            
              Created
              July 11, 2020 17:32
            
          
    Week 7

11 July, 2020


This was a very slow week, more thinking and decision making and less coding. There were a few issues with the file structure of the add-on. Turns out NVDA expects every Python file in the globalPlugins directory to contain an instance of globalPluginHandler.GlobalPlugin. After spending an embarrassingly long amount of time (read two days). I was finally able to solve it by packaging all the code as a Python package. You can see the code here.
After discussions with my mentor Reef, I came to realize that I may have been focused on the wrong things. With such projects, it is quite easy to lose track of your initial goals and fly off on a tangent. I started worrying about things like the size and speed of the object detection models and lost focus on the real goal, to make an add-on that is useful and userfriendly for non-visual users. I wished to release the add-on and get feedback on which model the users thin

  
## GSoC Blog: week 6.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                ShubhamJain7
                / GSoC Blog: week 6.md
            
            
              Created
              July 4, 2020 15:31
            
          
    Week 6

4 July, 2020


Last week, we created a DLL for the YOLOv3 darknet models and a client that could use it in Python. I started this week by using the outputs of the model, which are in the form
struct Detection {
    int classId,
    float probability,
 int x1,
	import wx

	class Frame(wx.Frame):
	def __init__(self,boxes):
	super(Frame, self).__init__(None, title="Bounding boxes")

	self.boxes = boxes
	self.boundingBoxes = []
	self.status = []