Welp, thankfully I was wrong about the model being wrong. It was my code that was wrong😥
The ONNX model only returned a pointer to the first value of the output which was stored contiguously in memory. To make processing the output easier, I decided to store the values in a 2D array. I declared my 2d array as float probs[92][100];
while my output dimension was 100 x 92
:P. That meant that even though it looked like the output straight from the model looked right, the results after processing were wildly different. Having fixed that, I had to reverse-engineer the postprocessing steps and implement them in C++. The first step was to apply the softmax function on each of the 100
rows. Softmax is just a function to normalize a vector of values into probabilities. To calculate the softmax of a vector you divide the exponent of each value in the vector by the sum of exponents taken over the entire vector. This gives us the probability of each class (column) in each detection (row). Next, we ignore the last column and then find the maximum probability and its index in each row. If the maximum probability exceeds the confidence threshold we set (0.75 in my case), we can use the index to find the corresponding class label and that's our detected object!
You can have a look at my work here.
Next, I did the same with YOLOv3-Darknet here. This was much easier since there is barely any pre/post-processing to do with this model and it doesn't depend on ONNX Runtime! The darknet model config file and weights can be directly loaded using OpenCV's DNN module which is a godsend!
There was, however, one small issue. The bounding boxes generated by the DETR model I used were shifter to the left of the objects they were supposed to enclose. I first assumed that this was because I was resizing my images to 256 x 256
before passing it through the model (ONNX Runtime works better with fixed-sized inputs), unlike the original DETR model that just resizes the smallest dimension to 800
and then interpolates the other dimension. With some testing, I found out that the DETR model is highly sensitive to the image size. Not only were the bounding boxes off when working with images of lower resolution, but even the identified objects were also very different. One obvious fix to this issue is to use an ONNX model that accepts dynamically sized inputs. That way, I could use images of any size, and get the best possible output. Unfortunately, there are some unresolved issues with exporting/importing the ONNX model with dynamic input/output from PyTorch. It wasn't a really big issue so we decided to let it be.
The next phase was to convert there "Console applications" to DLLs that could be used from the Python code. The file structure of a DLL project in Visual Studio was very confusing to me but I eventually got the hang of it after a lot of help from my mentor! The idea is to separate the C and C++ code and expose or export your functions. I was able to do this without so much as a hitch and had my DLL ready! You can check out the repo here.
It was fairly easy to load and use the DLL in a Python script by using Cython
using the command:
lib = CDLL("C:/Users/dell/source/repos/YOLOv3-DLL/Release/YOLOv3-DLL.dll")
However, you must make sure to load all the DLLs your DLL depends on first. In my case, I had to load the OpenCV DLLs opencv_core430.dll
, opencv_imgproc430.dll
, opencv_imgcodecs430.dll
and opencv_dnn430.dll
in that particular order. (Note: 430 is just the OpenCV version). After this, you register your DLL function return type and arguments and then you can call the functions!