ShubhamJain7/GSoC Blog: week 10.md

## GSoC Blog: week 10.md

      
    Raw
  

              GSoC Blog: week 10.md
            
          
    Week 10

2 August, 2020


Hello!😄
I spent this week working on adding the “bounding box” feature to the Object Detection addon. The idea is to draw boxes around the detected objects and when users move the mouse pointer or their finger (in case of touchscreen computers), the object label would be announced. This allows users to understand not just the objects in the image but the relative positions of those objects in the image. For example, "a man above a bicycle" and "a man beside a bicycle" paint very different pictures.
I first attempted to achieve this feature using the wxPython library. I came up with three solutions that you can find here. I will not get into explaining these solutions because I didn't end up using them. However, the reasons for not implementing them are relevant. All three of the solutions had some drawbacks and, I felt that they didn't make for good user experience. The biggest reason though is that NVDA ships with the Focus Highlight feature that does the something similar. Why re-invent the wheel, right? As the name suggests, focus highlight draws highlights around the currently focused elements on the screen. You can read more about it here. In NVDA's terminology, it is what's called a visionEncahncementProvider that itself is a small part of enhancers. I ended up copying the code for Focus Highlight, removing all the unnecessary parts and modifying the rest to fit and needs and viola! We can now draw boxes on the screen. There was, however, one small problem.
The object detection model returns the "in-image" co-ordinates (in [left, top, width, height] format) which I then need to convert to 'in-screen' co-ordinates. Even after doing this conversion, my boxes were being drawn at the wrong location. The width and height seemed right but the position on the screen was wrong. Initially, we thought it might have been due to screen scaling or the co-ordinate system used. In the end, I discovered that the fault was much simpler and sillier than I could've imagined. I ended up storing the detected coordinates as [top, left, width, height] instead of [left, top, width, height] way back in DLL files😓. See here for more detail.
Making the boxes disappear when the user changed focus to a different element was the next goal. I needed to covert my highlighter script to a visionEnhancemntProvider. The advantage of this is that NVDA picks it up automatically and initializes the highlighter on start-up and also, I can receive even notifications for stuff like focus change and the mouse moved. Exactly what I needed. After a bit of effort, I managed to get NVDA to identify my highlighter as a visionEnhancemntProvider and also receive event notifications from it. Just clear the list storing the boxes to draw (drawn continuously in a loop as the screen keeps refreshing even though you can't see it happen) and we're done! Making NVDA announce the object label when the user's mouse pointer or finger enters the box was fairly simple. I used the same logic I used in the three wxPython solutions I talked about earlier (not a completely futile effort, huh?). I still need to make the boxes disappear on pressing ESC and finalize the UX part of the add-on.
On the other hand, I spent all of yesterday stringing together and testing the Image Captioning add-on. I also managed to create a release out of it for user feedback (which is also why this blog post is a day late). The code for the two add-ons is pretty similar except for the bounding boxes part so it was pretty easy.