Skip to content

Instantly share code, notes, and snippets.

@YashasSamaga
Last active September 22, 2024 12:22
Show Gist options
  • Save YashasSamaga/a84cf2826ab2dc755005321fe17cd15d to your computer and use it in GitHub Desktop.
Save YashasSamaga/a84cf2826ab2dc755005321fe17cd15d to your computer and use it in GitHub Desktop.
GSoC 2019 | OpenCV | Adding a CUDA backend to the DNN module

DISCLAIMER

This gist documents the Google Summer of Code project. It is not updated and hence does not indicate current status of the CUDA backend.

For updated details, please see this gist.

Allow the OpenCV's DNN module to work with GPUs

Student: Yashas Samaga B L

Mentor: Davis King

Project Link: https://summerofcode.withgoogle.com/projects/#6021087400296448

Relevant PRs:

Introduction

The OpenCV’s DNN module has a blazing fast inference capability on CPUs. It supports performing inference on GPUs using OpenCL but lacks a CUDA backend. NVIDIA’s GPUs support OpenCL, but their capabilities are limited by OpenCL.

This project adds a new CUDA backend that can perform lightning fast inference on NVIDIA GPUs.

How to use?

Build

The CUDA backend requires CUDA Toolkit and cuDNN (min: 7.5.0) to be installed on the system. The CMake scripts will automatically detect the dependencies when the following options are set:

  • WITH_CUDA
  • WITH_CUDNN

The CUDA backend is enabled by setting the following option:

  • OPENCV_DNN_CUDA

After building, run [build dir]/bin/opencv_test_dnn and [build dir]/bin/opencv_perf_dnn.

Usage

The project adds the following new backends and targets to the existing list.

Backend Target
DNN_BACKEND_CUDA DNN_TARGET_CUDA
DNN_BACKEND_CUDA DNN_TARGET_CUDA_FP16

Support Matrix

The CUDA backend uses OpenCV's CPU backend as a fallback for unsupported layers and partially supported layers with unsupported configurations.

Blip Meaning
✔️ fully supported
🔵 partially supported
unsupported
Layer Status
Activations ✔️
Batch Normalization ✔️
Blank Layer ✔️
Concat Layer ✔️
Const Layer ✔️
Convolution 2d ✔️
Convolution 3d ✔️
Crop and resize
Crop Layer ✔️
Detection Output Layer
Deconvolution 2d 🔵 (most configurations supported)
Deconvolution 3d 🔵 (most configurations supported)
Elementwise Layers ✔️
Eltwise Layer ✔️
Flatten Layer ✔️
Fully Connected Layer ✔️
Input Layer
Interp Layer ✔️
Local Response Normalization ✔️
Max Unpooling 2d ✔️
Max Unpooling 3d ✔️
MVN Layer
Normalize Layer 🔵 (L1 and L2 supported)
Padding Layer ✔️
Permute Layer ✔️
Pooling 2d 🔵 (max and average supported)
Pooling 3d 🔵 (max and average supported)
Prior Box Layer ✔️
Proposal Layer
Region Layer ✔️
Reorg Layer ✔️
Reshape Layer ✔️
Resize Layer ✔️
Scale Layer ✔️
Shift Layer ✔️
Shuffle Channel Layer ✔️
Slice Layer ✔️
Softmax Layer ✔️
Split Layer ✔️
LSTM Layer

OCV CPU vs IE CPU vs CUDA

CPU: i7 7700HQ

GPU: NVIDIA GTX 1050 Mobile

CPU BLAS Library: MKL 2019.0.4

CUDA Version: 10.1

cuDNN: 7.6.2

Warmup Runs: 3 (forward pass is performed three times before benchmarks)

Benchmark Runs: 10 (the average of ten forward passes is reported)

Test Code: https://gist.github.com/YashasSamaga/71157cf0c3768c497e5e70fb95435596

Batch Size = 1

Model CUDA FP32 Inference Engine CPU OpenCV CPU
GoogLeNet 7.2447ms 10.4981ms 17.9176ms
DenseNet121 12.6324ms 19.1823ms 48.0628ms
EAST Text Detection 18.8281ms 49.0508ms 88.9429ms
ENet 11.5014ms Exception 62.5854ms
FastNeuralStyle StaryNight 27.498ms 178.309ms 160.359ms
Inception 5h 7.8546ms 22.2789ms 20.3255ms
Inception v2 FasterRCNN 112.736ms Exception 374.26ms
MobileNet SSD 58.4751ms 9.2896ms 27.3061ms
OpenCV Face Detector 6.9831ms 8.3981ms 17.6683ms
OpenPose Pose MPI 160.561ms 509.446ms 838.161ms
Resnet 50 11.3603ms 28.1529ms 50.2752ms
SqueezeNet 2.4084ms 3.2918ms 5.476ms
VGG16 SSD 70.4117ms 249.725ms 360.207ms
Yolo v3 57.9822ms 214.629ms 296.806ms
Yolo v2 51.5784ms 193.453ms 260.19ms

Batch Size = 10

Model CUDA FP32 Inference Engine CPU OpenCV CPU
GoogLeNet 35.7556ms 108.946ms 225.928ms
DenseNet121 74.9241ms 295.105ms 650.924ms
EAST Text Detection 149.58ms 536.946ms 1273.93ms
FastNeuralStyle StaryNight 283.173ms 1966.5ms 2175.3ms
Inception 5h 36.6225ms 180.429ms 233.276ms
MobileNet SSD 277.753ms 111.872ms 316.063ms
OpenCV Face Detector 52.4366ms 95.7866ms 202.657ms
OpenPose Pose MPI 628.617ms 5650.05ms 10683.5ms
Resnet 50 74.283ms 230.817ms 541.308ms
SqueezeNet 15.8144ms 35.4915ms 69.4122ms
VGG16 SSD 594.286ms 2796.23ms 4661.51ms
Yolo v3 488.704ms 2419.8ms 4209.74ms
Yolo v2 491.414ms 2185.47ms 3788.34ms

OpenCV CUDA vs OpenCV CPU

CPU: 2x Intel Xeon E5-2640 v4

GPU: 1x NVIDIA GTX 1080 Ti (11 GB)

CPU BLAS Library: OpenBLAS 0.2.20

CUDA Version: 10.0

cuDNN: 7.6.2

Warmup Runs: 3 (forward pass is performed three times before benchmarks)

Benchmark Runs: 10 (the average of ten forward passes is reported)

Test Code: https://gist.github.com/YashasSamaga/71157cf0c3768c497e5e70fb95435596

Backend Comparision

Batch Size = 1

Model CUDA FP32 OpenCV CPU
GoogLeNet 4.8824ms 14.2981ms
DenseNet121 6.4555ms 57.8244ms
EAST Text Detection 5.901ms 67.4301ms
ENet 4.5979ms 30.2767ms
FastNeuralStyle StaryNight 5.3193ms 51.3313ms
Inception 5h 4.9487ms 16.0048ms
Inception v2 FasterRCNN 82.0298ms 179.245ms
MobileNet SSD 70.9177ms 23.9348ms
OpenCV Face Detector 4.9288ms 15.4205ms
OpenPose Pose MPI 30.5954ms 246.747ms
Resnet 50 4.5968ms 45.1153ms
SqueezeNet 1.0888ms 3.6492ms
VGG16 SSD 23.5926ms 194.976ms
Yolo v3 18.0002ms 141.861ms
Yolo v2 12.1279ms 111.642ms

Batch Size = 10

Model CUDA FP32 OpenCV CPU
GoogLeNet 10.149ms 75.9591ms
DenseNet121 20.269ms 312.426ms
EAST Text Detection 32.1556ms 402.16ms
FastNeuralStyle StaryNight 49.1025ms 461.095ms
Inception 5h 9.9721ms 67.9308ms
MobileNet SSD 96.2898ms 110.783ms
OpenCV Face Detector 22.7501ms 77.8742ms
OpenPose Pose MPI 118.858ms 2321.89ms
Resnet 50 18.4139ms 229.599ms
SqueezeNet 4.4893ms 22.3049ms
VGG16 SSD 194.181ms 1319.67ms
Yolo v3 122.603ms 1044.11ms
Yolo v2 104.072ms 819.177ms

Batch Size = 128

Model CUDA FP32 OpenCV CPU
GoogLeNet 90.3755ms 775.769ms
DenseNet121 199.516ms 3536.38ms
EAST Text Detection 376.458ms 7685.72ms
FastNeuralStyle StaryNight 801.778ms 6607.15ms
Inception 5h 93.4188ms 771.575ms
MobileNet SSD 1028.93ms 1110.37ms
OpenCV Face Detector 276.992ms 977.997ms
OpenPose Pose MPI 1279.26ms 32159.3ms
Resnet 50 200.789ms 1719.92ms
SqueezeNet 55.6244ms 255.397ms
VGG16 SSD 2969.05ms 17201ms
Yolo v3 1564.78ms 13699.2ms
Yolo v2 1362.84ms 11254.9ms

Images processed per second (CUDA FP32)

Model batch size = 1 batch size = 10 batch size = 128
GoogLeNet 204 985 1416
DenseNet121 154 493 641
EAST Text Detection 169 311 340
ENet 217 Not Applicable Not Applicable
FastNeuralStyle StaryNight 188 204 160
Inception 5h 202 1002 1370
Inception v2 FasterRCNN 12 Not Aplicable Not Applicable
MobileNet SSD 14 104 124
OpenCV Face Detector 202 440 462
OpenPose Pose MPI 33 84 100
Resnet 50 217 540 637
SqueezeNet 918 2228 2301
VGG16 SSD 42 52 43
Yolo v3 55 82 81
Yolo v2 82 96 93

OpenCV CUDA vs TensorFlow

GPU: NVIDIA GTX 1080 Ti (11 GB)

Batch of 1

Model OpenCV CUDA TensorFlow
ResNet-50 4.5968ms 7.1163ms
EAST Text Detection 5.901ms 8.6890ms

Batch of 10

Model OpenCV CUDA TensorFlow
ResNet-50 18.4139ms 22.3665ms
EAST Text Detection 32.1556ms 39.4857ms

Batch of 128

Model OpenCV CUDA TensorFlow
ResNet-50 200.789ms 216.3923ms
EAST Text Detection 376.458ms 421.8292ms
@fengyuentau
Copy link

The benchmarks I posted are for MobileNetSSD_deploy.prototxt/MobileNetSSD_deploy.caffemodel which you can find here.

MobileNet is slow with the CUDA backend because of depthwise convolutions. The CUDA backend fully relies on cuDNN for convolutions and cuDNN is very bad at depthwise convolutions.

Has this bad performance of depthwise conv fixed for cuDNN now? I just tested a DNN model with depthwise conv on a Jetson Nano B01 with CUDA 10.2 & cuDNN 8.2.1. Results turned out to be slower than the SoC of Jetson Nano (arm-based, 4-core, 1.5GHz), and much slower than the SoC on Raspberry Pi 4B (4-core, 1.5 GHz).

Also tested a model w/o depthwise conv, which seems to be meeting expectations (~10X times faster than the Jetson Nano SoC).

@Algabri
Copy link

Algabri commented Mar 10, 2022

Do you have an academic paper for these results?
I would like to cite.

@YashasSamaga
Copy link
Author

@Algabri No, there is no paper that is specific to this project.

@Algabri
Copy link

Algabri commented Mar 11, 2022

@YashasSamaga , Thanks for your reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment