Skip to content

Instantly share code, notes, and snippets.

@ShubhamJain7
Created July 18, 2020 14:34
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ShubhamJain7/d7a83274b39844550008a24040b5a6e5 to your computer and use it in GitHub Desktop.
Save ShubhamJain7/d7a83274b39844550008a24040b5a6e5 to your computer and use it in GitHub Desktop.

Week 8

18 July, 2020

Lots of coding this past week! I started out trying to fix most of the issues that the add-on release had. The biggest of which was that users were getting the “Cannot identify any objects in the image” more often than useful results. After looking into it a little deeper, I discovered three potential problems that might be contributing to this issue.

  1. The model really couldn't identify any objects in the image. This was the most obvious one but also one over which I had no control. The release was shipped with the tiniest(lol) of the 3 models, Tiny-YOLOv3. This was by choice since we didn't want anyone from being turned away from testing the add-on because of the download size. Of course, choosing a small model means the results won't be too good.
  2. Users were trying to run object detection on non-image elements on their screen. This seems a little unlikely but it was a case that needed to be handled anyway. Unfortunately, contentRecog.recogUi.recognizeNavigatorObject didn't have any mechanism to perform any sort of validation on the navigator object being run through the model(more on this later). Running the model, even if by mistake, on non-image elements only wastes the users time.
  3. The gesture used to trigger the model, Alt+NVDA+D would clash with another gesture used in browsers to open the address bar. The only time this didn't happen was when there was no element in focus. This meant that a screenshot of the entire browser window, including the targetted image, was run through the model. This would produce results but not in the way intended. I discovered this by adding a breakpoint to the code before the temporary screenshot was deleted and looking at what was being run through the model. (Always use the debugger kids). This one was easy to solve by adding gesture under the vision category and allowing users to edit it if it clashed with any other gesture.

I figured that images smaller than 128x128 wouldn't be large enough to produce good results and they needed to be filtered out before being passed into the model. An added benefit to this filtering was that most non-image elements in UI like buttons and address bars would be filtered out. Again, this had to be done within contentRecog.recogUi.recognizeNavigatorObject. Since it was lacking, I ended up writing my own implementation of the function and creating PR's that did the same in the NVDA repo. Next, I needed a definitive way of filtering out all non-image elements. Turns out NVDA objects have a role attribute that takes the value ROLE_GRAPHIC for image elements. Unfortunately, it isn't very reliable. Images in the browser can sometimes have ROLE_LINK and images opened in the default Windows 10 photo viewer app have ROLE_STATICTEXT. This wasn't entirely useless though, blind users would not be able to tell if there is an image on their screen if they have such roles. Filtering out all elements that don't have a role of ROLE_GRAPHIC could save them some time when they make mistakes. Of course, it would still be possible to run the model on all elements through a separate command.

Another issue was that some users use a screen-curtain which is essentially a curtain that blanks out the entire screen. Since the add-on relies on taking screenshots of the relevant part of the screen, a screen-curtain renders it useless. Seeing no way around the curtain (:p), I decided to just prompt the user when they try to use the add-on with the screen curtain enabled.

Among other things, contentRecog.recogUi.recognizeNavigatorObject is also very rigid in terms of how it presents results. I still need to modify it so that it gives users the freedom to present the result in multiple ways and not just the hardcoded "virtual window" way.

I also converted an image captioning model into ONNX format and wrote the C++ code to make it run in between all this! This process was much easier considering how the model lent itself to ONNX conversion much more easily and experience helps too. You can see that code here. Converting this a DLL was....not easy. I struggled with how C++ and C handle strings and pointers. I was finally able to figure out that passing pointers created in functions would not work since functions are placed on the stack and that memory is flushed on returning from the function. Next, I tried passing a string buffer to be filled in from the caller to the callee function. For some odd reason, no matter what I tried, the memory wasn't being written to. I suspected that I wasn't converting std::string to char * properly and that was the issue. After experimenting with simpler code on an IDE, I figured out the correct away but that didn't work in the DLL either. In the end, I ended up copying the std::string character by character into the char * buffer passed by the calling function and it finally worked! I hope to get it up and running as a separate add-on by the end of next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment