ShubhamJain7/GSoC Blog: week 6.md

## GSoC Blog: week 6.md

      
    Raw
  

              GSoC Blog: week 6.md
            
          
    Week 6

4 July, 2020


Last week, we created a DLL for the YOLOv3 darknet models and a client that could use it in Python. I started this week by using the outputs of the model, which are in the form
struct Detection {
    int classId,
    float probability,
    int x1,
    int y1,
    int x2,
    int y2
}

and used this data to convert the results into user understandable results. More importantly, into information users care about. While discussing this project, my mentors and I realized that just creating a sentence that lists out of the result of the object detection model is enough information to paint a picture in the user's mind, which is exactly what this project aims to do. For example, the sentence This image contains a person and bicycle should be enough to make you imagine a person riding a bicycle. The person could be standing next to a bicycle in the image and the model would still produce the same result, but this is a start. Later, we will move on to implementing sophisticated Image Captioning models that can understand the relations between objects and create more complex sentences. As a start, however, we just use the singular and plural forms of each possible object that could be detected and construct a grammatically correct sentence out of it. Here's the code that I used to achieve this:
classLabels = [CLASSES_SINGULAR[d.classId] for d in objects]
	counts = Counter(classLabels)
	number_of_items = len(counts)

	output_string = "The image contains "
	for i,key in enumerate(counts.keys()):
		if counts[key] == 1:
			output_string = output_string + key
		# if there are multiple instances of the same object in the image, use plural form
		else:
			output_string = output_string + CLASSES_PLURAL[CLASSES_SINGULAR.index(key)]

		# rules for listing out identified objects
		if i < (number_of_items - 2):
			# add commas only if there are more tham two objects
			if number_of_items > 2:
				output_string = output_string + ", "
			else:
				output_string = output_string + " "
		# use and if its the second last object
		elif i == (number_of_items - 2):
			output_string = output_string + " and "
		# end sentence with fullstop
		else:
			output_string = output_string + "."

In essence, We first use collections.Counter to determine how many instances of the same object we detected in the image. If it is more than 1, we use the plural form of the object and the singular form otherwise. The rest of the code just adds commas between the listed items and an 'and' after the second last item.
After doing the same with the DETR model, I finally moved on to incorporating these models into NVDA! Since these models are quite large (33~250MB), we decided to make the project into add-ons that users can download separately from NVDA. First, we do the research and understand how NVDA add-ons work and see how others have achieved similar goals. The NVDA Add on Development Guide does an amazing job of explaining everything you might need to know. With examples! I also referred to this PR which contained pretty much all the code I needed to write.
The first step was being able to detect 'gestures' or keyboard shortcuts by the user irrespective of what's on the screen currently. For this, we use a globalPlugin. Once we can detect gestures, we tie a gesture to a script that calls one of our Object detection models.
This as as easy as:
class GlobalPlugin(globalPluginHandler.GlobalPlugin):

	def script_detectObjectsTinyYOLOv3(self, gesture):
		x = doDetectionTinyYOLOv3()
		recogUi.recognizeNavigatorObject(x)
	
	__gestures={
		"kb:NVDA+A": "detectObjectsTinyYOLOv3"
	}

With this, whenever the user presses NVDA+A the script_detectObjectsTinyYOLOv3 method will be called. But what is the doDetectionTinyYOLOv3() class you ask? It's just a class I created that inherits from the contentRecog.ContentRecognizer class which can be found here. This class is responsible for finding the image element on the screen, finding out it's coordinates on the screen, taking a 'screenshot' of the image, saving it in the Temp folder, starting the detection on a background thread and then doing whatever you want with the image. It may sound complicated but it's just a little bit of code:
class doDetectionTinyYOLOv3(contentRecog.ContentRecognizer):

	def recognize(self, pixels, imgInfo, onResult):
		bmp = wx.EmptyBitmap(imgInfo.recogWidth, imgInfo.recogHeight, 32)
		bmp.CopyFromBuffer(pixels, wx.BitmapBufferFormat_RGB32)
		self._imagePath = tempfile.mktemp(prefix="nvda_ObjectDetect_", suffix=".jpg")
		bmp.SaveFile(self._imagePath, wx.BITMAP_TYPE_JPEG)
		self._onResult = onResult
		t = threading.Thread(target=self._bgRecog)
		t.daemon = True
		t.start()

	def _bgRecog(self):
		try:
			result = self.detect(self._imagePath)
		except Exception as e:
			result = e
		finally:
			os.remove(self._imagePath)
		if self._onResult:
			self._onResult(result)

	def cancel(self):
		self._onResult = None

	def detect(self, imagePath):
		result = YOLOv3Detection(imagePath, tiny=True).getSentence()
		return contentRecog.SimpleTextResult(result)

The YOLOv3Detection(imagePath, tiny=True).getSentence() call in the detect function above makes a call to our Python YOLOv3 DLL client and returns the result in the form of a sentence. Now, we need to pass the doDetectionTinyYOLOv3 object and pass it to recogUi.recognizeNavigatorObject() which can be found here. This class is responsible for managing the detection tread. It calls the recognize method, manages multiple detection sessions(by not allowing the user to start more than one) and then presenting the result to the user. To present the result, an 'invisible editText' like element is created and set to focus so that the user can use ctrl+arrow keys to navigate the result. This makes it easy for the user to hear the result multiple times, listen to specific parts of the result and even copy it.
Next week, we tie all this code together and package it into a pre-release add-on that users can download, test and provide feedback on. We still don't know what model would be best or if the users want the option to have all the models! We also need to highlight the objects in the screen using bounding boxes so users know where the objects are in the image.