This gist documents the Google Summer of Code project. It is not updated and hence does not indicate current status of the CUDA backend.
For updated details, please see this gist.
This gist documents the Google Summer of Code project. It is not updated and hence does not indicate current status of the CUDA backend.
For updated details, please see this gist.
Student: Yashas Samaga B L
Mentor: Davis King
Project Link: https://summerofcode.withgoogle.com/projects/#6021087400296448
Relevant PRs:
The OpenCV’s DNN module has a blazing fast inference capability on CPUs. It supports performing inference on GPUs using OpenCL but lacks a CUDA backend. NVIDIA’s GPUs support OpenCL, but their capabilities are limited by OpenCL.
This project adds a new CUDA backend that can perform lightning fast inference on NVIDIA GPUs.
The CUDA backend requires CUDA Toolkit and cuDNN (min: 7.5.0) to be installed on the system. The CMake scripts will automatically detect the dependencies when the following options are set:
WITH_CUDA
WITH_CUDNN
The CUDA backend is enabled by setting the following option:
OPENCV_DNN_CUDA
After building, run [build dir]/bin/opencv_test_dnn
and [build dir]/bin/opencv_perf_dnn
.
The project adds the following new backends and targets to the existing list.
Backend | Target |
---|---|
DNN_BACKEND_CUDA |
DNN_TARGET_CUDA |
DNN_BACKEND_CUDA |
DNN_TARGET_CUDA_FP16 |
The CUDA backend uses OpenCV's CPU backend as a fallback for unsupported layers and partially supported layers with unsupported configurations.
Blip | Meaning |
---|---|
✔️ | fully supported |
🔵 | partially supported |
❌ | unsupported |
Layer | Status |
---|---|
Activations | ✔️ |
Batch Normalization | ✔️ |
Blank Layer | ✔️ |
Concat Layer | ✔️ |
Const Layer | ✔️ |
Convolution 2d | ✔️ |
Convolution 3d | ✔️ |
Crop and resize | ❌ |
Crop Layer | ✔️ |
Detection Output Layer | ❌ |
Deconvolution 2d | 🔵 (most configurations supported) |
Deconvolution 3d | 🔵 (most configurations supported) |
Elementwise Layers | ✔️ |
Eltwise Layer | ✔️ |
Flatten Layer | ✔️ |
Fully Connected Layer | ✔️ |
Input Layer | ❌ |
Interp Layer | ✔️ |
Local Response Normalization | ✔️ |
Max Unpooling 2d | ✔️ |
Max Unpooling 3d | ✔️ |
MVN Layer | ❌ |
Normalize Layer | 🔵 (L1 and L2 supported) |
Padding Layer | ✔️ |
Permute Layer | ✔️ |
Pooling 2d | 🔵 (max and average supported) |
Pooling 3d | 🔵 (max and average supported) |
Prior Box Layer | ✔️ |
Proposal Layer | ❌ |
Region Layer | ✔️ |
Reorg Layer | ✔️ |
Reshape Layer | ✔️ |
Resize Layer | ✔️ |
Scale Layer | ✔️ |
Shift Layer | ✔️ |
Shuffle Channel Layer | ✔️ |
Slice Layer | ✔️ |
Softmax Layer | ✔️ |
Split Layer | ✔️ |
LSTM Layer | ❌ |
CPU: i7 7700HQ
GPU: NVIDIA GTX 1050 Mobile
CPU BLAS Library: MKL 2019.0.4
CUDA Version: 10.1
cuDNN: 7.6.2
Warmup Runs: 3 (forward pass is performed three times before benchmarks)
Benchmark Runs: 10 (the average of ten forward passes is reported)
Test Code: https://gist.github.com/YashasSamaga/71157cf0c3768c497e5e70fb95435596
Model | CUDA FP32 | Inference Engine CPU | OpenCV CPU |
---|---|---|---|
GoogLeNet | 7.2447ms | 10.4981ms | 17.9176ms |
DenseNet121 | 12.6324ms | 19.1823ms | 48.0628ms |
EAST Text Detection | 18.8281ms | 49.0508ms | 88.9429ms |
ENet | 11.5014ms | Exception | 62.5854ms |
FastNeuralStyle StaryNight | 27.498ms | 178.309ms | 160.359ms |
Inception 5h | 7.8546ms | 22.2789ms | 20.3255ms |
Inception v2 FasterRCNN | 112.736ms | Exception | 374.26ms |
MobileNet SSD | 58.4751ms | 9.2896ms | 27.3061ms |
OpenCV Face Detector | 6.9831ms | 8.3981ms | 17.6683ms |
OpenPose Pose MPI | 160.561ms | 509.446ms | 838.161ms |
Resnet 50 | 11.3603ms | 28.1529ms | 50.2752ms |
SqueezeNet | 2.4084ms | 3.2918ms | 5.476ms |
VGG16 SSD | 70.4117ms | 249.725ms | 360.207ms |
Yolo v3 | 57.9822ms | 214.629ms | 296.806ms |
Yolo v2 | 51.5784ms | 193.453ms | 260.19ms |
Model | CUDA FP32 | Inference Engine CPU | OpenCV CPU |
---|---|---|---|
GoogLeNet | 35.7556ms | 108.946ms | 225.928ms |
DenseNet121 | 74.9241ms | 295.105ms | 650.924ms |
EAST Text Detection | 149.58ms | 536.946ms | 1273.93ms |
FastNeuralStyle StaryNight | 283.173ms | 1966.5ms | 2175.3ms |
Inception 5h | 36.6225ms | 180.429ms | 233.276ms |
MobileNet SSD | 277.753ms | 111.872ms | 316.063ms |
OpenCV Face Detector | 52.4366ms | 95.7866ms | 202.657ms |
OpenPose Pose MPI | 628.617ms | 5650.05ms | 10683.5ms |
Resnet 50 | 74.283ms | 230.817ms | 541.308ms |
SqueezeNet | 15.8144ms | 35.4915ms | 69.4122ms |
VGG16 SSD | 594.286ms | 2796.23ms | 4661.51ms |
Yolo v3 | 488.704ms | 2419.8ms | 4209.74ms |
Yolo v2 | 491.414ms | 2185.47ms | 3788.34ms |
CPU: 2x Intel Xeon E5-2640 v4
GPU: 1x NVIDIA GTX 1080 Ti (11 GB)
CPU BLAS Library: OpenBLAS 0.2.20
CUDA Version: 10.0
cuDNN: 7.6.2
Warmup Runs: 3 (forward pass is performed three times before benchmarks)
Benchmark Runs: 10 (the average of ten forward passes is reported)
Test Code: https://gist.github.com/YashasSamaga/71157cf0c3768c497e5e70fb95435596
Model | CUDA FP32 | OpenCV CPU |
---|---|---|
GoogLeNet | 4.8824ms | 14.2981ms |
DenseNet121 | 6.4555ms | 57.8244ms |
EAST Text Detection | 5.901ms | 67.4301ms |
ENet | 4.5979ms | 30.2767ms |
FastNeuralStyle StaryNight | 5.3193ms | 51.3313ms |
Inception 5h | 4.9487ms | 16.0048ms |
Inception v2 FasterRCNN | 82.0298ms | 179.245ms |
MobileNet SSD | 70.9177ms | 23.9348ms |
OpenCV Face Detector | 4.9288ms | 15.4205ms |
OpenPose Pose MPI | 30.5954ms | 246.747ms |
Resnet 50 | 4.5968ms | 45.1153ms |
SqueezeNet | 1.0888ms | 3.6492ms |
VGG16 SSD | 23.5926ms | 194.976ms |
Yolo v3 | 18.0002ms | 141.861ms |
Yolo v2 | 12.1279ms | 111.642ms |
Model | CUDA FP32 | OpenCV CPU |
---|---|---|
GoogLeNet | 10.149ms | 75.9591ms |
DenseNet121 | 20.269ms | 312.426ms |
EAST Text Detection | 32.1556ms | 402.16ms |
FastNeuralStyle StaryNight | 49.1025ms | 461.095ms |
Inception 5h | 9.9721ms | 67.9308ms |
MobileNet SSD | 96.2898ms | 110.783ms |
OpenCV Face Detector | 22.7501ms | 77.8742ms |
OpenPose Pose MPI | 118.858ms | 2321.89ms |
Resnet 50 | 18.4139ms | 229.599ms |
SqueezeNet | 4.4893ms | 22.3049ms |
VGG16 SSD | 194.181ms | 1319.67ms |
Yolo v3 | 122.603ms | 1044.11ms |
Yolo v2 | 104.072ms | 819.177ms |
Model | CUDA FP32 | OpenCV CPU |
---|---|---|
GoogLeNet | 90.3755ms | 775.769ms |
DenseNet121 | 199.516ms | 3536.38ms |
EAST Text Detection | 376.458ms | 7685.72ms |
FastNeuralStyle StaryNight | 801.778ms | 6607.15ms |
Inception 5h | 93.4188ms | 771.575ms |
MobileNet SSD | 1028.93ms | 1110.37ms |
OpenCV Face Detector | 276.992ms | 977.997ms |
OpenPose Pose MPI | 1279.26ms | 32159.3ms |
Resnet 50 | 200.789ms | 1719.92ms |
SqueezeNet | 55.6244ms | 255.397ms |
VGG16 SSD | 2969.05ms | 17201ms |
Yolo v3 | 1564.78ms | 13699.2ms |
Yolo v2 | 1362.84ms | 11254.9ms |
Model | batch size = 1 | batch size = 10 | batch size = 128 |
---|---|---|---|
GoogLeNet | 204 | 985 | 1416 |
DenseNet121 | 154 | 493 | 641 |
EAST Text Detection | 169 | 311 | 340 |
ENet | 217 | Not Applicable | Not Applicable |
FastNeuralStyle StaryNight | 188 | 204 | 160 |
Inception 5h | 202 | 1002 | 1370 |
Inception v2 FasterRCNN | 12 | Not Aplicable | Not Applicable |
MobileNet SSD | 14 | 104 | 124 |
OpenCV Face Detector | 202 | 440 | 462 |
OpenPose Pose MPI | 33 | 84 | 100 |
Resnet 50 | 217 | 540 | 637 |
SqueezeNet | 918 | 2228 | 2301 |
VGG16 SSD | 42 | 52 | 43 |
Yolo v3 | 55 | 82 | 81 |
Yolo v2 | 82 | 96 | 93 |
GPU: NVIDIA GTX 1080 Ti (11 GB)
Model | OpenCV CUDA | TensorFlow |
---|---|---|
ResNet-50 | 4.5968ms | 7.1163ms |
EAST Text Detection | 5.901ms | 8.6890ms |
Model | OpenCV CUDA | TensorFlow |
---|---|---|
ResNet-50 | 18.4139ms | 22.3665ms |
EAST Text Detection | 32.1556ms | 39.4857ms |
Model | OpenCV CUDA | TensorFlow |
---|---|---|
ResNet-50 | 200.789ms | 216.3923ms |
EAST Text Detection | 376.458ms | 421.8292ms |
The benchmarks I posted are for
MobileNetSSD_deploy.prototxt
/MobileNetSSD_deploy.caffemodel
which you can find here.MobileNet is slow with the CUDA backend because of depthwise convolutions. The CUDA backend fully relies on cuDNN for convolutions and cuDNN is very bad at depthwise convolutions.
Has this bad performance of depthwise conv fixed for cuDNN now? I just tested a DNN model with depthwise conv on a Jetson Nano B01 with CUDA 10.2 & cuDNN 8.2.1. Results turned out to be slower than the SoC of Jetson Nano (arm-based, 4-core, 1.5GHz), and much slower than the SoC on Raspberry Pi 4B (4-core, 1.5 GHz).
Also tested a model w/o depthwise conv, which seems to be meeting expectations (~10X times faster than the Jetson Nano SoC).
Do you have an academic paper for these results?
I would like to cite.
@Algabri No, there is no paper that is specific to this project.
@YashasSamaga , Thanks for your reply.
@matt-sharp https://gist.github.com/YashasSamaga/e2b19a6807a13046e399f4bc3cca3a49