YashasSamaga/A0_NOTICE.md

## A0_NOTICE.md

      
    Raw
  

              A0_NOTICE.md
            
          
    DISCLAIMER

This gist documents the Google Summer of Code project. It is not updated and hence does not indicate current status of the CUDA backend.
For updated details, please see this gist.

  
## D0_Summary.md

      
    Raw
  

              D0_Summary.md
            
          
    Allow the OpenCV's DNN module to work with GPUs

Student: Yashas Samaga B L
Mentor: Davis King
Project Link: https://summerofcode.withgoogle.com/projects/#6021087400296448
Relevant PRs:

add cuDNN dependency and setup build for cuda4dnn
add CUDA backend to the DNN module

Introduction

The OpenCV’s DNN module has a blazing fast inference capability on CPUs. It supports performing inference on GPUs using OpenCL but lacks a CUDA backend. NVIDIA’s GPUs support OpenCL, but their capabilities are limited by OpenCL.
This project adds a new CUDA backend that can perform lightning fast inference on NVIDIA GPUs.
How to use?

Build

The CUDA backend requires CUDA Toolkit and cuDNN (min: 7.5.0) to be installed on the system. The CMake scripts will automatically detect the dependencies when the following options are set:

WITH_CUDA
WITH_CUDNN

The CUDA backend is enabled by setting the following option:

OPENCV_DNN_CUDA

After building, run [build dir]/bin/opencv_test_dnn and [build dir]/bin/opencv_perf_dnn.
Usage

The project adds the following new backends and targets to the existing list.


Backend
Target


DNN_BACKEND_CUDA
DNN_TARGET_CUDA


DNN_BACKEND_CUDA
DNN_TARGET_CUDA_FP16


## D1_SupportMatrix.md

      
    Raw
  

              D1_SupportMatrix.md
            
          
    Support Matrix

The CUDA backend uses OpenCV's CPU backend as a fallback for unsupported layers and partially supported layers with unsupported configurations.


Blip
Meaning


✔️
fully supported


🔵
partially supported


❌
unsupported


Layer
Status


Activations
✔️


Batch Normalization
✔️


Blank Layer
✔️


Concat Layer
✔️


Const Layer
✔️


Convolution 2d
✔️


Convolution 3d
✔️


Crop and resize
❌


Crop Layer
✔️


Detection Output Layer
❌


Deconvolution 2d
🔵 (most configurations supported)


Deconvolution 3d
🔵 (most configurations supported)


Elementwise Layers
✔️


Eltwise Layer
✔️


Flatten Layer
✔️


Fully Connected Layer
✔️


Input Layer
❌


Interp Layer
✔️


Local Response Normalization
✔️


Max Unpooling 2d
✔️


Max Unpooling 3d
✔️


MVN Layer
❌


Normalize Layer
🔵 (L1 and L2 supported)


Padding Layer
✔️


Permute Layer
✔️


Pooling 2d
🔵 (max and average supported)


Pooling 3d
🔵 (max and average supported)


Prior Box Layer
✔️


Proposal Layer
❌


Region Layer
✔️


Reorg Layer
✔️


Reshape Layer
✔️


Resize Layer
✔️


Scale Layer
✔️


Shift Layer
✔️


Shuffle Channel Layer
✔️


Slice Layer
✔️


Softmax Layer
✔️


Split Layer
✔️


LSTM Layer
❌


## D2_Backend_Comparision.md

      
    Raw
  

              D2_Backend_Comparision.md
            
          
    OCV CPU vs IE CPU vs CUDA

CPU: i7 7700HQ
GPU: NVIDIA GTX 1050 Mobile
CPU BLAS Library: MKL 2019.0.4
CUDA Version: 10.1
cuDNN: 7.6.2
Warmup Runs: 3 (forward pass is performed three times before benchmarks)
Benchmark Runs: 10 (the average of ten forward passes is reported)
Test Code: https://gist.github.com/YashasSamaga/71157cf0c3768c497e5e70fb95435596
Batch Size = 1


Model
CUDA FP32
Inference Engine CPU
OpenCV CPU


GoogLeNet
7.2447ms
10.4981ms
17.9176ms


DenseNet121
12.6324ms
19.1823ms
48.0628ms


EAST Text Detection
18.8281ms
49.0508ms
88.9429ms


ENet
11.5014ms
Exception
62.5854ms


FastNeuralStyle StaryNight
27.498ms
178.309ms
160.359ms


Inception 5h
7.8546ms
22.2789ms
20.3255ms


Inception v2 FasterRCNN
112.736ms
Exception
374.26ms


MobileNet SSD
58.4751ms
9.2896ms
27.3061ms


OpenCV Face Detector
6.9831ms
8.3981ms
17.6683ms


OpenPose Pose MPI
160.561ms
509.446ms
838.161ms


Resnet 50
11.3603ms
28.1529ms
50.2752ms


SqueezeNet
2.4084ms
3.2918ms
5.476ms


VGG16 SSD
70.4117ms
249.725ms
360.207ms


Yolo v3
57.9822ms
214.629ms
296.806ms


Yolo v2
51.5784ms
193.453ms
260.19ms


Batch Size = 10


Model
CUDA FP32
Inference Engine CPU
OpenCV CPU


GoogLeNet
35.7556ms
108.946ms
225.928ms


DenseNet121
74.9241ms
295.105ms
650.924ms


EAST Text Detection
149.58ms
536.946ms
1273.93ms


FastNeuralStyle StaryNight
283.173ms
1966.5ms
2175.3ms


Inception 5h
36.6225ms
180.429ms
233.276ms


MobileNet SSD
277.753ms
111.872ms
316.063ms


OpenCV Face Detector
52.4366ms
95.7866ms
202.657ms


OpenPose Pose MPI
628.617ms
5650.05ms
10683.5ms


Resnet 50
74.283ms
230.817ms
541.308ms


SqueezeNet
15.8144ms
35.4915ms
69.4122ms


VGG16 SSD
594.286ms
2796.23ms
4661.51ms


Yolo v3
488.704ms
2419.8ms
4209.74ms


Yolo v2
491.414ms
2185.47ms
3788.34ms


## D3_CUDA_vs_CPU.md

      
    Raw
  

              D3_CUDA_vs_CPU.md
            
          
    OpenCV CUDA vs OpenCV CPU

CPU: 2x Intel Xeon E5-2640 v4
GPU: 1x NVIDIA GTX 1080 Ti (11 GB)
CPU BLAS Library: OpenBLAS 0.2.20
CUDA Version: 10.0
cuDNN: 7.6.2
Warmup Runs: 3 (forward pass is performed three times before benchmarks)
Benchmark Runs: 10 (the average of ten forward passes is reported)
Test Code: https://gist.github.com/YashasSamaga/71157cf0c3768c497e5e70fb95435596
Backend Comparision

Batch Size = 1


Model
CUDA FP32
OpenCV CPU


GoogLeNet
4.8824ms
14.2981ms


DenseNet121
6.4555ms
57.8244ms


EAST Text Detection
5.901ms
67.4301ms


ENet
4.5979ms
30.2767ms


FastNeuralStyle StaryNight
5.3193ms
51.3313ms


Inception 5h
4.9487ms
16.0048ms


Inception v2 FasterRCNN
82.0298ms
179.245ms


MobileNet SSD
70.9177ms
23.9348ms


OpenCV Face Detector
4.9288ms
15.4205ms


OpenPose Pose MPI
30.5954ms
246.747ms


Resnet 50
4.5968ms
45.1153ms


SqueezeNet
1.0888ms
3.6492ms


VGG16 SSD
23.5926ms
194.976ms


Yolo v3
18.0002ms
141.861ms


Yolo v2
12.1279ms
111.642ms


Batch Size = 10


Model
CUDA FP32
OpenCV CPU


GoogLeNet
10.149ms
75.9591ms


DenseNet121
20.269ms
312.426ms


EAST Text Detection
32.1556ms
402.16ms


FastNeuralStyle StaryNight
49.1025ms
461.095ms


Inception 5h
9.9721ms
67.9308ms


MobileNet SSD
96.2898ms
110.783ms


OpenCV Face Detector
22.7501ms
77.8742ms


OpenPose Pose MPI
118.858ms
2321.89ms


Resnet 50
18.4139ms
229.599ms


SqueezeNet
4.4893ms
22.3049ms


VGG16 SSD
194.181ms
1319.67ms


Yolo v3
122.603ms
1044.11ms


Yolo v2
104.072ms
819.177ms


Batch Size = 128


Model
CUDA FP32
OpenCV CPU


GoogLeNet
90.3755ms
775.769ms


DenseNet121
199.516ms
3536.38ms


EAST Text Detection
376.458ms
7685.72ms


FastNeuralStyle StaryNight
801.778ms
6607.15ms


Inception 5h
93.4188ms
771.575ms


MobileNet SSD
1028.93ms
1110.37ms


OpenCV Face Detector
276.992ms
977.997ms


OpenPose Pose MPI
1279.26ms
32159.3ms


Resnet 50
200.789ms
1719.92ms


SqueezeNet
55.6244ms
255.397ms


VGG16 SSD
2969.05ms
17201ms


Yolo v3
1564.78ms
13699.2ms


Yolo v2
1362.84ms
11254.9ms


Images processed per second (CUDA FP32)


Model
batch size = 1
batch size = 10
batch size = 128


GoogLeNet
204
985
1416


DenseNet121
154
493
641


EAST Text Detection
169
311
340


ENet
217
Not Applicable
Not Applicable


FastNeuralStyle StaryNight
188
204
160


Inception 5h
202
1002
1370


Inception v2 FasterRCNN
12
Not Aplicable
Not Applicable


MobileNet SSD
14
104
124


OpenCV Face Detector
202
440
462


OpenPose Pose MPI
33
84
100


Resnet 50
217
540
637


SqueezeNet
918
2228
2301


VGG16 SSD
42
52
43


Yolo v3
55
82
81


Yolo v2
82
96
93


## D4_OpenCV_vs_TensorFlow.md

      
    Raw
  

              D4_OpenCV_vs_TensorFlow.md
            
          
    OpenCV CUDA vs TensorFlow

GPU: NVIDIA GTX 1080 Ti (11 GB)
Batch of 1


Model
OpenCV CUDA
TensorFlow


ResNet-50
4.5968ms
7.1163ms


EAST Text Detection
5.901ms
8.6890ms


Batch of 10


Model
OpenCV CUDA
TensorFlow


ResNet-50
18.4139ms
22.3665ms


EAST Text Detection
32.1556ms
39.4857ms


Batch of 128


Model
OpenCV CUDA
TensorFlow


ResNet-50
200.789ms
216.3923ms


EAST Text Detection
376.458ms
421.8292ms
Backend	Target
`DNN_BACKEND_CUDA`	`DNN_TARGET_CUDA`
`DNN_BACKEND_CUDA`	`DNN_TARGET_CUDA_FP16`
Layer	Status
Activations	✔️
Batch Normalization	✔️
Blank Layer	✔️
Concat Layer	✔️
Const Layer	✔️
Convolution 2d	✔️
Convolution 3d	✔️
Crop and resize	❌
Crop Layer	✔️
Detection Output Layer	❌
Deconvolution 2d	🔵 (most configurations supported)
Deconvolution 3d	🔵 (most configurations supported)
Elementwise Layers	✔️
Eltwise Layer	✔️
Flatten Layer	✔️
Fully Connected Layer	✔️
Input Layer	❌
Interp Layer	✔️
Local Response Normalization	✔️
Max Unpooling 2d	✔️
Max Unpooling 3d	✔️
MVN Layer	❌
Normalize Layer	🔵 (L1 and L2 supported)
Padding Layer	✔️
Permute Layer	✔️
Pooling 2d	🔵 (max and average supported)
Pooling 3d	🔵 (max and average supported)
Prior Box Layer	✔️
Proposal Layer	❌
Region Layer	✔️
Reorg Layer	✔️
Reshape Layer	✔️
Resize Layer	✔️
Scale Layer	✔️
Shift Layer	✔️
Shuffle Channel Layer	✔️
Slice Layer	✔️
Softmax Layer	✔️
Split Layer	✔️
LSTM Layer	❌
Model	CUDA FP32	Inference Engine CPU	OpenCV CPU
GoogLeNet	7.2447ms	10.4981ms	17.9176ms
DenseNet121	12.6324ms	19.1823ms	48.0628ms
EAST Text Detection	18.8281ms	49.0508ms	88.9429ms
ENet	11.5014ms	Exception	62.5854ms
FastNeuralStyle StaryNight	27.498ms	178.309ms	160.359ms
Inception 5h	7.8546ms	22.2789ms	20.3255ms
Inception v2 FasterRCNN	112.736ms	Exception	374.26ms
MobileNet SSD	58.4751ms	9.2896ms	27.3061ms
OpenCV Face Detector	6.9831ms	8.3981ms	17.6683ms
OpenPose Pose MPI	160.561ms	509.446ms	838.161ms
Resnet 50	11.3603ms	28.1529ms	50.2752ms
SqueezeNet	2.4084ms	3.2918ms	5.476ms
VGG16 SSD	70.4117ms	249.725ms	360.207ms
Yolo v3	57.9822ms	214.629ms	296.806ms
Yolo v2	51.5784ms	193.453ms	260.19ms
Model	CUDA FP32	Inference Engine CPU	OpenCV CPU
GoogLeNet	35.7556ms	108.946ms	225.928ms
DenseNet121	74.9241ms	295.105ms	650.924ms
EAST Text Detection	149.58ms	536.946ms	1273.93ms
FastNeuralStyle StaryNight	283.173ms	1966.5ms	2175.3ms
Inception 5h	36.6225ms	180.429ms	233.276ms
MobileNet SSD	277.753ms	111.872ms	316.063ms
OpenCV Face Detector	52.4366ms	95.7866ms	202.657ms
OpenPose Pose MPI	628.617ms	5650.05ms	10683.5ms
Resnet 50	74.283ms	230.817ms	541.308ms
SqueezeNet	15.8144ms	35.4915ms	69.4122ms
VGG16 SSD	594.286ms	2796.23ms	4661.51ms
Yolo v3	488.704ms	2419.8ms	4209.74ms
Yolo v2	491.414ms	2185.47ms	3788.34ms
Model	CUDA FP32	OpenCV CPU
GoogLeNet	4.8824ms	14.2981ms
DenseNet121	6.4555ms	57.8244ms
EAST Text Detection	5.901ms	67.4301ms
ENet	4.5979ms	30.2767ms
FastNeuralStyle StaryNight	5.3193ms	51.3313ms
Inception 5h	4.9487ms	16.0048ms
Inception v2 FasterRCNN	82.0298ms	179.245ms
MobileNet SSD	70.9177ms	23.9348ms
OpenCV Face Detector	4.9288ms	15.4205ms
OpenPose Pose MPI	30.5954ms	246.747ms
Resnet 50	4.5968ms	45.1153ms
SqueezeNet	1.0888ms	3.6492ms
VGG16 SSD	23.5926ms	194.976ms
Yolo v3	18.0002ms	141.861ms
Yolo v2	12.1279ms	111.642ms
Model	CUDA FP32	OpenCV CPU
GoogLeNet	10.149ms	75.9591ms
DenseNet121	20.269ms	312.426ms
EAST Text Detection	32.1556ms	402.16ms
FastNeuralStyle StaryNight	49.1025ms	461.095ms
Inception 5h	9.9721ms	67.9308ms
MobileNet SSD	96.2898ms	110.783ms
OpenCV Face Detector	22.7501ms	77.8742ms
OpenPose Pose MPI	118.858ms	2321.89ms
Resnet 50	18.4139ms	229.599ms
SqueezeNet	4.4893ms	22.3049ms
VGG16 SSD	194.181ms	1319.67ms
Yolo v3	122.603ms	1044.11ms
Yolo v2	104.072ms	819.177ms
Model	CUDA FP32	OpenCV CPU
GoogLeNet	90.3755ms	775.769ms
DenseNet121	199.516ms	3536.38ms
EAST Text Detection	376.458ms	7685.72ms
FastNeuralStyle StaryNight	801.778ms	6607.15ms
Inception 5h	93.4188ms	771.575ms
MobileNet SSD	1028.93ms	1110.37ms
OpenCV Face Detector	276.992ms	977.997ms
OpenPose Pose MPI	1279.26ms	32159.3ms
Resnet 50	200.789ms	1719.92ms
SqueezeNet	55.6244ms	255.397ms
VGG16 SSD	2969.05ms	17201ms
Yolo v3	1564.78ms	13699.2ms
Yolo v2	1362.84ms	11254.9ms
Model	batch size = 1	batch size = 10	batch size = 128
GoogLeNet	204	985	1416
DenseNet121	154	493	641
EAST Text Detection	169	311	340
ENet	217	Not Applicable	Not Applicable
FastNeuralStyle StaryNight	188	204	160
Inception 5h	202	1002	1370
Inception v2 FasterRCNN	12	Not Aplicable	Not Applicable
MobileNet SSD	14	104	124
OpenCV Face Detector	202	440	462
OpenPose Pose MPI	33	84	100
Resnet 50	217	540	637
SqueezeNet	918	2228	2301
VGG16 SSD	42	52	43
Yolo v3	55	82	81
Yolo v2	82	96	93
Model	OpenCV CUDA	TensorFlow
ResNet-50	4.5968ms	7.1163ms
EAST Text Detection	5.901ms	8.6890ms
Model	OpenCV CUDA	TensorFlow
ResNet-50	18.4139ms	22.3665ms
EAST Text Detection	32.1556ms	39.4857ms