Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
GSoC 2019 | OpenCV | Adding a CUDA backend to the DNN module

DISCLAIMER

This gist documents the Google Summer of Code project. It is not updated and hence does not indicate current status of the CUDA backend.

For updated details, please see this gist.

Allow the OpenCV's DNN module to work with GPUs

Student: Yashas Samaga B L

Mentor: Davis King

Project Link: https://summerofcode.withgoogle.com/projects/#6021087400296448

Relevant PRs:

Introduction

The OpenCV’s DNN module has a blazing fast inference capability on CPUs. It supports performing inference on GPUs using OpenCL but lacks a CUDA backend. NVIDIA’s GPUs support OpenCL, but their capabilities are limited by OpenCL.

This project adds a new CUDA backend that can perform lightning fast inference on NVIDIA GPUs.

How to use?

Build

The CUDA backend requires CUDA Toolkit and cuDNN (min: 7.5.0) to be installed on the system. The CMake scripts will automatically detect the dependencies when the following options are set:

  • WITH_CUDA
  • WITH_CUDNN

The CUDA backend is enabled by setting the following option:

  • OPENCV_DNN_CUDA

After building, run [build dir]/bin/opencv_test_dnn and [build dir]/bin/opencv_perf_dnn.

Usage

The project adds the following new backends and targets to the existing list.

Backend Target
DNN_BACKEND_CUDA DNN_TARGET_CUDA
DNN_BACKEND_CUDA DNN_TARGET_CUDA_FP16

Support Matrix

The CUDA backend uses OpenCV's CPU backend as a fallback for unsupported layers and partially supported layers with unsupported configurations.

Blip Meaning
✔️ fully supported
🔵 partially supported
unsupported
Layer Status
Activations ✔️
Batch Normalization ✔️
Blank Layer ✔️
Concat Layer ✔️
Const Layer ✔️
Convolution 2d ✔️
Convolution 3d ✔️
Crop and resize
Crop Layer ✔️
Detection Output Layer
Deconvolution 2d 🔵 (most configurations supported)
Deconvolution 3d 🔵 (most configurations supported)
Elementwise Layers ✔️
Eltwise Layer ✔️
Flatten Layer ✔️
Fully Connected Layer ✔️
Input Layer
Interp Layer ✔️
Local Response Normalization ✔️
Max Unpooling 2d ✔️
Max Unpooling 3d ✔️
MVN Layer
Normalize Layer 🔵 (L1 and L2 supported)
Padding Layer ✔️
Permute Layer ✔️
Pooling 2d 🔵 (max and average supported)
Pooling 3d 🔵 (max and average supported)
Prior Box Layer ✔️
Proposal Layer
Region Layer ✔️
Reorg Layer ✔️
Reshape Layer ✔️
Resize Layer ✔️
Scale Layer ✔️
Shift Layer ✔️
Shuffle Channel Layer ✔️
Slice Layer ✔️
Softmax Layer ✔️
Split Layer ✔️
LSTM Layer

OCV CPU vs IE CPU vs CUDA

CPU: i7 7700HQ

GPU: NVIDIA GTX 1050 Mobile

CPU BLAS Library: MKL 2019.0.4

CUDA Version: 10.1

cuDNN: 7.6.2

Warmup Runs: 3 (forward pass is performed three times before benchmarks)

Benchmark Runs: 10 (the average of ten forward passes is reported)

Test Code: https://gist.github.com/YashasSamaga/71157cf0c3768c497e5e70fb95435596

Batch Size = 1

Model CUDA FP32 Inference Engine CPU OpenCV CPU
GoogLeNet 7.2447ms 10.4981ms 17.9176ms
DenseNet121 12.6324ms 19.1823ms 48.0628ms
EAST Text Detection 18.8281ms 49.0508ms 88.9429ms
ENet 11.5014ms Exception 62.5854ms
FastNeuralStyle StaryNight 27.498ms 178.309ms 160.359ms
Inception 5h 7.8546ms 22.2789ms 20.3255ms
Inception v2 FasterRCNN 112.736ms Exception 374.26ms
MobileNet SSD 58.4751ms 9.2896ms 27.3061ms
OpenCV Face Detector 6.9831ms 8.3981ms 17.6683ms
OpenPose Pose MPI 160.561ms 509.446ms 838.161ms
Resnet 50 11.3603ms 28.1529ms 50.2752ms
SqueezeNet 2.4084ms 3.2918ms 5.476ms
VGG16 SSD 70.4117ms 249.725ms 360.207ms
Yolo v3 57.9822ms 214.629ms 296.806ms
Yolo v2 51.5784ms 193.453ms 260.19ms

Batch Size = 10

Model CUDA FP32 Inference Engine CPU OpenCV CPU
GoogLeNet 35.7556ms 108.946ms 225.928ms
DenseNet121 74.9241ms 295.105ms 650.924ms
EAST Text Detection 149.58ms 536.946ms 1273.93ms
FastNeuralStyle StaryNight 283.173ms 1966.5ms 2175.3ms
Inception 5h 36.6225ms 180.429ms 233.276ms
MobileNet SSD 277.753ms 111.872ms 316.063ms
OpenCV Face Detector 52.4366ms 95.7866ms 202.657ms
OpenPose Pose MPI 628.617ms 5650.05ms 10683.5ms
Resnet 50 74.283ms 230.817ms 541.308ms
SqueezeNet 15.8144ms 35.4915ms 69.4122ms
VGG16 SSD 594.286ms 2796.23ms 4661.51ms
Yolo v3 488.704ms 2419.8ms 4209.74ms
Yolo v2 491.414ms 2185.47ms 3788.34ms

OpenCV CUDA vs OpenCV CPU

CPU: 2x Intel Xeon E5-2640 v4

GPU: 1x NVIDIA GTX 1080 Ti (11 GB)

CPU BLAS Library: OpenBLAS 0.2.20

CUDA Version: 10.0

cuDNN: 7.6.2

Warmup Runs: 3 (forward pass is performed three times before benchmarks)

Benchmark Runs: 10 (the average of ten forward passes is reported)

Test Code: https://gist.github.com/YashasSamaga/71157cf0c3768c497e5e70fb95435596

Backend Comparision

Batch Size = 1

Model CUDA FP32 OpenCV CPU
GoogLeNet 4.8824ms 14.2981ms
DenseNet121 6.4555ms 57.8244ms
EAST Text Detection 5.901ms 67.4301ms
ENet 4.5979ms 30.2767ms
FastNeuralStyle StaryNight 5.3193ms 51.3313ms
Inception 5h 4.9487ms 16.0048ms
Inception v2 FasterRCNN 82.0298ms 179.245ms
MobileNet SSD 70.9177ms 23.9348ms
OpenCV Face Detector 4.9288ms 15.4205ms
OpenPose Pose MPI 30.5954ms 246.747ms
Resnet 50 4.5968ms 45.1153ms
SqueezeNet 1.0888ms 3.6492ms
VGG16 SSD 23.5926ms 194.976ms
Yolo v3 18.0002ms 141.861ms
Yolo v2 12.1279ms 111.642ms

Batch Size = 10

Model CUDA FP32 OpenCV CPU
GoogLeNet 10.149ms 75.9591ms
DenseNet121 20.269ms 312.426ms
EAST Text Detection 32.1556ms 402.16ms
FastNeuralStyle StaryNight 49.1025ms 461.095ms
Inception 5h 9.9721ms 67.9308ms
MobileNet SSD 96.2898ms 110.783ms
OpenCV Face Detector 22.7501ms 77.8742ms
OpenPose Pose MPI 118.858ms 2321.89ms
Resnet 50 18.4139ms 229.599ms
SqueezeNet 4.4893ms 22.3049ms
VGG16 SSD 194.181ms 1319.67ms
Yolo v3 122.603ms 1044.11ms
Yolo v2 104.072ms 819.177ms

Batch Size = 128

Model CUDA FP32 OpenCV CPU
GoogLeNet 90.3755ms 775.769ms
DenseNet121 199.516ms 3536.38ms
EAST Text Detection 376.458ms 7685.72ms
FastNeuralStyle StaryNight 801.778ms 6607.15ms
Inception 5h 93.4188ms 771.575ms
MobileNet SSD 1028.93ms 1110.37ms
OpenCV Face Detector 276.992ms 977.997ms
OpenPose Pose MPI 1279.26ms 32159.3ms
Resnet 50 200.789ms 1719.92ms
SqueezeNet 55.6244ms 255.397ms
VGG16 SSD 2969.05ms 17201ms
Yolo v3 1564.78ms 13699.2ms
Yolo v2 1362.84ms 11254.9ms

Images processed per second (CUDA FP32)

Model batch size = 1 batch size = 10 batch size = 128
GoogLeNet 204 985 1416
DenseNet121 154 493 641
EAST Text Detection 169 311 340
ENet 217 Not Applicable Not Applicable
FastNeuralStyle StaryNight 188 204 160
Inception 5h 202 1002 1370
Inception v2 FasterRCNN 12 Not Aplicable Not Applicable
MobileNet SSD 14 104 124
OpenCV Face Detector 202 440 462
OpenPose Pose MPI 33 84 100
Resnet 50 217 540 637
SqueezeNet 918 2228 2301
VGG16 SSD 42 52 43
Yolo v3 55 82 81
Yolo v2 82 96 93

OpenCV CUDA vs TensorFlow

GPU: NVIDIA GTX 1080 Ti (11 GB)

Batch of 1

Model OpenCV CUDA TensorFlow
ResNet-50 4.5968ms 7.1163ms
EAST Text Detection 5.901ms 8.6890ms

Batch of 10

Model OpenCV CUDA TensorFlow
ResNet-50 18.4139ms 22.3665ms
EAST Text Detection 32.1556ms 39.4857ms

Batch of 128

Model OpenCV CUDA TensorFlow
ResNet-50 200.789ms 216.3923ms
EAST Text Detection 376.458ms 421.8292ms
@Saafke

This comment has been minimized.

Copy link

@Saafke Saafke commented Aug 25, 2019

Amazing work! 🙌🎉

@Avrohom

This comment has been minimized.

Copy link

@Avrohom Avrohom commented Sep 12, 2019

Hi @YashasSamaga,

Can you compile it in Windows 10?

I have tried to compile your Master branch. And I am getting errors:

image

image

@YashasSamaga

This comment has been minimized.

Copy link
Owner Author

@YashasSamaga YashasSamaga commented Sep 12, 2019

@Avrohom Please use opencv/opencv#14827 for further discussions.

This is an issue with CMake. You have to update CUDA_ARCH_BIN option. I had pushed a commit to throw a cmake error for unsupported configuration. Updating your clone might help.

@stephenvidler

This comment has been minimized.

Copy link

@stephenvidler stephenvidler commented Oct 2, 2019

I've been trying to compile on Windows 10 with VS2015 but get the following errors when compiling activations.cu:

error C2912: explicit specialization 'unsigned int cv::dnn::cuda4dnn::csl::device::detail::getGridDim<0>(void)' is not a specialization of a function template

Any suggestions?

@YashasSamaga

This comment has been minimized.

Copy link
Owner Author

@YashasSamaga YashasSamaga commented Oct 2, 2019

@stephenvidler Please use opencv/opencv#14827 for further discussions.

At first sight, it appears like an MSVC bug. The identified function is a function template which is contradictory to what the error reports. Somebody watching the PR might know more about this.

@kuronekodaisuki

This comment has been minimized.

Copy link

@kuronekodaisuki kuronekodaisuki commented Oct 21, 2019

Great works!

@Hastyrush

This comment has been minimized.

Copy link

@Hastyrush Hastyrush commented Oct 25, 2019

Nice work!

Any idea on why the MobileNet SSD pales in comparison drastically on the CUDA backend as compared to the Inference Engine or the CPU? I presume it has something to do with depthwise separable convolution layers not being supported at the moment?

Thanks a lot!

@YashasSamaga

This comment has been minimized.

Copy link
Owner Author

@YashasSamaga YashasSamaga commented Oct 25, 2019

@Hastyrush cuDNN performs depthwise convolutions poorly.

More information is available here: opencv/opencv#14827 (comment)

@phil-ludewig

This comment has been minimized.

Copy link

@phil-ludewig phil-ludewig commented Nov 27, 2019

Thanks for your work! How would I implement it with opencv 4.1.1? It's not merged with that branch yet.

@YashasSamaga

This comment has been minimized.

Copy link
Owner Author

@YashasSamaga YashasSamaga commented Nov 27, 2019

@phil-ludewig I'm sorry, you can't. You need to use the master branch.

@lorenzolightsgdwarf

This comment has been minimized.

Copy link

@lorenzolightsgdwarf lorenzolightsgdwarf commented Dec 3, 2019

@YashasSamaga thank you for your work. Unfortunately I cannot reproduce the benchmark results on mobile net ssd v2 my machine.
The avg inference time is 180ms on a a batch size 1.
I'm using Cuda 10.2, cuDNN 7.6.3, NVIDIA GTX 1050, CPU i7-7700HQ.
Using TensorFlow Cpp library with CUDA the inference time is only 30 ms.
Do you have any suggestion?
This is my code for the benchmark

#include <iostream>
#include <opencv2/dnn.hpp>

using namespace std;

int main() {
    std::vector<cv::Mat> images;
    cv::Mat image(cv::Size(300, 300), CV_8UC3);
    cv::randu(image, cv::Scalar(0, 0, 0), cv::Scalar(255, 255, 255));
    images.push_back(image);
    cv::Mat blob = cv::dnn::blobFromImages(images, 1.0f, cv::Size(300, 300), 0.0f);
    auto net = cv::dnn::readNet("frozen_inference_graph.pb", "ssd_mobilenet_v2_coco_2018_03_29.pbtxt");
    net.setPreferableBackend(cv::dnn::Backend::DNN_BACKEND_CUDA);
    net.setPreferableTarget(cv::dnn::Target::DNN_TARGET_CUDA);
    // Warm up
    for (int i = 0; i < 5; i++) {
        net.setInput(blob);
        net.forward();
    }
    for (int i = 0; i < 10; i++) {
        cv::TickMeter tic;
        tic.start();
        net.setInput(blob);
        net.forward();
        tic.stop();
        std::cout << tic.getTimeMilli() << std::endl;
    }

    return 0;
}

My cmake configuration

General configuration for OpenCV 4.1.2-dev =====================================
  Version control:               4.1.2-238-g78c5e41c23

  Extra modules:
    Location (extra):            C:/Users/Lorenzo Lucignano/opencv_contrib/modules
    Version control (extra):     4.1.2-65-g763a4516

  Platform:
    Timestamp:                   2019-12-02T11:02:35Z
    Host:                        Windows 10.0.18362 AMD64
    CMake:                       3.15.3
    CMake generator:             Visual Studio 15 2017
    CMake build tool:            C:/Program Files (x86)/Microsoft Visual Studio/2017/Community/MSBuild/15.0/Bin/MSBuild.exe
    MSVC:                        1916

  CPU/HW features:
    Baseline:                    SSE SSE2 SSE3
      requested:                 SSE3
    Dispatched code generation:  SSE4_1 SSE4_2 FP16 AVX AVX2 AVX512_SKX
      requested:                 SSE4_1 SSE4_2 AVX FP16 AVX2 AVX512_SKX
      SSE4_1 (13 files):         + SSSE3 SSE4_1
      SSE4_2 (1 files):          + SSSE3 SSE4_1 POPCNT SSE4_2
      FP16 (0 files):            + SSSE3 SSE4_1 POPCNT SSE4_2 FP16 AVX
      AVX (4 files):             + SSSE3 SSE4_1 POPCNT SSE4_2 AVX
      AVX2 (26 files):           + SSSE3 SSE4_1 POPCNT SSE4_2 FP16 FMA3 AVX AVX2
      AVX512_SKX (3 files):      + SSSE3 SSE4_1 POPCNT SSE4_2 FP16 FMA3 AVX AVX2 AVX_512F AVX512_COMMON AVX512_SKX

  C/C++:
    Built as dynamic libs?:      YES
    C++ Compiler:                C:/Program Files (x86)/Microsoft Visual Studio/2017/Community/VC/Tools/MSVC/14.16.27023/bin/Hostx86/x64/cl.exe  (ver 19.16.27026.1)
    C++ flags (Release):         /DWIN32 /D_WINDOWS /W4 /GR  /D _CRT_SECURE_NO_DEPRECATE /D _CRT_NONSTDC_NO_DEPRECATE /D _SCL_SECURE_NO_WARNINGS /Gy /bigobj /Oi  /fp:precise     /EHa /wd4127 /wd4251 /wd4324 /wd4275 /wd4512 /wd4589 /MP8   /MD /O2 /Ob2 /DNDEBUG 
    C++ flags (Debug):           /DWIN32 /D_WINDOWS /W4 /GR  /D _CRT_SECURE_NO_DEPRECATE /D _CRT_NONSTDC_NO_DEPRECATE /D _SCL_SECURE_NO_WARNINGS /Gy /bigobj /Oi  /fp:precise     /EHa /wd4127 /wd4251 /wd4324 /wd4275 /wd4512 /wd4589 /MP8   /MDd /Zi /Ob0 /Od /RTC1 
    C Compiler:                  C:/Program Files (x86)/Microsoft Visual Studio/2017/Community/VC/Tools/MSVC/14.16.27023/bin/Hostx86/x64/cl.exe
    C flags (Release):           /DWIN32 /D_WINDOWS /W3  /D _CRT_SECURE_NO_DEPRECATE /D _CRT_NONSTDC_NO_DEPRECATE /D _SCL_SECURE_NO_WARNINGS /Gy /bigobj /Oi  /fp:precise       /MP8    /MD /O2 /Ob2 /DNDEBUG 
    C flags (Debug):             /DWIN32 /D_WINDOWS /W3  /D _CRT_SECURE_NO_DEPRECATE /D _CRT_NONSTDC_NO_DEPRECATE /D _SCL_SECURE_NO_WARNINGS /Gy /bigobj /Oi  /fp:precise       /MP8  /MDd /Zi /Ob0 /Od /RTC1 
    Linker flags (Release):      /machine:x64  /INCREMENTAL:NO 
    Linker flags (Debug):        /machine:x64  /debug /INCREMENTAL 
    ccache:                      NO
    Precompiled headers:         YES
    Extra dependencies:          cudart_static.lib nppc.lib nppial.lib nppicc.lib nppicom.lib nppidei.lib nppif.lib nppig.lib nppim.lib nppist.lib nppisu.lib nppitc.lib npps.lib cublas.lib cudnn.lib cufft.lib -LIBPATH:C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v10.2/lib/x64 -LIBPATH:C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v10.0/lib/x64
    3rdparty dependencies:

  OpenCV modules:
    To be built:                 calib3d core cudev dnn features2d flann highgui imgcodecs imgproc ml objdetect photo stitching ts video videoio
    Disabled:                    aruco bgsegm bioinspired ccalib cudaarithm cudabgsegm cudacodec cudafeatures2d cudafilters cudaimgproc cudalegacy cudaobjdetect cudaoptflow cudastereo cudawarping datasets dnn_objdetect dnn_superres dpm face fuzzy gapi hfs img_hash line_descriptor optflow phase_unwrapping plot python3 quality reg rgbd saliency shape stereo structured_light superres surface_matching text tracking videostab world xfeatures2d ximgproc xobjdetect xphoto
    Disabled by dependency:      -
    Unavailable:                 cnn_3dobj cvv freetype hdf java js matlab ovis python2 sfm viz
    Applications:                perf_tests
    Documentation:               NO
    Non-free algorithms:         NO

  Windows RT support:            NO

  GUI: 
    Win32 UI:                    YES
    VTK support:                 NO

  Media I/O: 
    ZLib:                        build (ver 1.2.11)
    JPEG:                        build-libjpeg-turbo (ver 2.0.2-62)
    WEBP:                        build (ver encoder: 0x020e)
    PNG:                         build (ver 1.6.37)
    TIFF:                        build (ver 42 - 4.0.10)
    JPEG 2000:                   build (ver 1.900.1)
    OpenEXR:                     build (ver 2.3.0)
    HDR:                         YES
    SUNRASTER:                   YES
    PXM:                         YES
    PFM:                         YES

  Video I/O:
    DC1394:                      NO
    FFMPEG:                      YES (prebuilt binaries)
      avcodec:                   YES (58.54.100)
      avformat:                  YES (58.29.100)
      avutil:                    YES (56.31.100)
      swscale:                   YES (5.5.100)
      avresample:                YES (4.0.0)
    GStreamer:                   NO
    DirectShow:                  YES
    Media Foundation:            YES
      DXVA:                      YES

  Parallel framework:            TBB (ver 2019.0 interface 11008)

  Trace:                         YES (with Intel ITT)

  Other third-party libraries:
    Intel IPP:                   2019.0.0 Gold [2019.0.0]
           at:                   C:/Users/Lorenzo Lucignano/opencv/build/3rdparty/ippicv/ippicv_win/icv
    Intel IPP IW:                sources (2019.0.0)
              at:                C:/Users/Lorenzo Lucignano/opencv/build/3rdparty/ippicv/ippicv_win/iw
    Lapack:                      NO
    Eigen:                       NO
    Custom HAL:                  NO
    Protobuf:                    build (3.5.1)

  NVIDIA CUDA:                   YES (ver 10.2, CUFFT CUBLAS)
    NVIDIA GPU arch:             53 60 61 70 75
    NVIDIA PTX archs:

  cuDNN:                         YES (ver 7.6.3)

  OpenCL:                        YES (NVD3D11)
    Include path:                C:/Users/Lorenzo Lucignano/opencv/3rdparty/include/opencl/1.2
    Link libraries:              Dynamic load

  Python (for build):            C:/Python27/python.exe

  Java:                          
    ant:                         NO
    JNI:                         C:/Program Files/Java/jdk1.8.0_211/include C:/Program Files/Java/jdk1.8.0_211/include/win32 C:/Program Files/Java/jdk1.8.0_211/include
    Java wrappers:               NO
    Java tests:                  NO

  Install to:                    C:/Users/Lorenzo Lucignano/opencv/build/install
-----------------------------------------------------------------
@YashasSamaga

This comment has been minimized.

Copy link
Owner Author

@YashasSamaga YashasSamaga commented Dec 3, 2019

The benchmarks I posted are for MobileNetSSD_deploy.prototxt/MobileNetSSD_deploy.caffemodel which you can find here.

MobileNet is slow with the CUDA backend because of depthwise convolutions. The CUDA backend fully relies on cuDNN for convolutions and cuDNN is very bad at depthwise convolutions.

@JulienMaille

This comment has been minimized.

Copy link

@JulienMaille JulienMaille commented Dec 11, 2019

Hi I spent quite some time trying to build a project hoping to run my network on gpu but noticed only today that "CUDA backend for DNN module requires CC 5.3 or higher". Unfortunately I only have a Geforce 970
What is the main reason for this limitation?
Do I need to integrate Nvidia SDK if I want my model to run on older GPUs?

@YashasSamaga

This comment has been minimized.

Copy link
Owner Author

@YashasSamaga YashasSamaga commented Dec 12, 2019

@Nefast

The backend uses half-precision intrinsics which isn't available on older GPUs. Currently, there is no way to disable the half-precision target even if you won't be using it. I'll see if it can be made a build option.

@JulienMaille

This comment has been minimized.

Copy link

@JulienMaille JulienMaille commented Dec 12, 2019

Thanks for the feedback. Supporting older Gpus would be a nice addition, even if performances are worse because it runs at full precision.

@YashasSamaga

This comment has been minimized.

Copy link
Owner Author

@YashasSamaga YashasSamaga commented Dec 21, 2019

@Nefast The vast bulk of changes would be required for files in dnn/src/cuda. The explicit template instantiations for the __half type would have to be disabled via a #ifdef. The overloads in dnn/src/cuda/math.hpp would also have to be dealt with. Some checks in dnn.cpp to identify use of DNN_TARGET_CUDA_FP16 when half precision is disabled.

@JulienMaille

This comment has been minimized.

Copy link

@JulienMaille JulienMaille commented Dec 21, 2019

@YashasSamaga
I've commited this: JulienMaille/opencv@2622a56
It compiles and I was able to run some inference on my Geforce 960!

I'll clean this up in the next days, any advice from you will be appreciated

Right now I limited support to CC 5.2+ but I suppose we can go lower

@Rasoul20sh

This comment has been minimized.

Copy link

@Rasoul20sh Rasoul20sh commented Dec 23, 2019

I've been trying to compile on Windows 10 with VS2015 but get the following errors when compiling activations.cu:

error C2912: explicit specialization 'unsigned int cv::dnn::cuda4dnn::csl::device::detail::getGridDim<0>(void)' is not a specialization of a function template

Any suggestions?

I have the same error. Did you solve it?

@YashasSamaga

This comment has been minimized.

Copy link
Owner Author

@YashasSamaga YashasSamaga commented Dec 23, 2019

@Rasoul20sh I think this issue is specific to older MSVC compilers. Can you try upgrading to a newer compiler? It builds without errors on VS17 and VS19.

I have got a reproducer: https://godbolt.org/z/CvttCb

I have tried messing around with the code in godbolt. I am not able to find a working solution to get around the bug.

@stephenvidler

This comment has been minimized.

Copy link

@stephenvidler stephenvidler commented Dec 23, 2019

@dsgh2

This comment has been minimized.

Copy link

@dsgh2 dsgh2 commented Dec 24, 2019

The template declarations were using C++ features not supported by the 2015 compiler. So I rewrote them to be simpler as they were only used in a couple of places. I will send you the changes later today,, when I can access my computer. Regards, Steve

On Mon, 23 Dec 2019, 5:57 a.m. Yashas Samaga B L, @.***> wrote: @Rasoul20sh https://github.com/Rasoul20sh I think this issue is specific to older MSVC compilers. Can you try upgrading to a newer compiler? It builds without errors on VS17 and VS19. I have got a reproducer: https://godbolt.org/z/CvttCb I have tried messing around with the code in godbolt. I am not able to find a working solution to get around the bug. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://gist.github.com/a84cf2826ab2dc755005321fe17cd15d?email_source=notifications&email_token=AF62OBOH3XF7Y56SLJ27U43Q2BHMXA5CNFSM4IPILTCKYY3PNVWWK3TUL52HS4DFVNDWS43UINXW23LFNZ2KUY3PNVWWK3TUL5UWJTQAF6LS6#gistcomment-3118895, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF62OBLJEX375DBGJ7DY6TDQ2BHMXANCNFSM4IPILTCA .

Hi,
I have same error.
What should I change in source code? Please let me know the way.

I guess, in grid_stride_range.hpp
template device auto getGridDim()->decltype(dim3::x);
template <> inline device auto getGridDim<0>()->decltype(dim3::x) { return gridDim.x; }
template <> inline device auto getGridDim<1>()->decltype(dim3::x) { return gridDim.y; }
template <> inline device auto getGridDim<2>()->decltype(dim3::x) { return gridDim.z; }

should be changed. but I do not know how to change.
Thanks in advance

@YashasSamaga

This comment has been minimized.

Copy link
Owner Author

@YashasSamaga YashasSamaga commented Dec 24, 2019

#include <cstddef>
#include <utility>

struct dim3 {
    unsigned x, y, z;
};

using ret_type = decltype(dim3::x);

template <int>
ret_type getGridDim();

template <> inline ret_type getGridDim<0>() { return 0; }

solves the problem

@YashasSamaga

This comment has been minimized.

Copy link
Owner Author

@YashasSamaga YashasSamaga commented Dec 24, 2019

@dsgh2 @Rasoul20sh Can you try building this branch with a fix?

@dsgh2

This comment has been minimized.

Copy link

@dsgh2 dsgh2 commented Dec 24, 2019

Thanks, I will try it

@stephenvidler

This comment has been minimized.

Copy link

@stephenvidler stephenvidler commented Dec 24, 2019

@Rasoul20sh

This comment has been minimized.

Copy link

@Rasoul20sh Rasoul20sh commented Dec 24, 2019

@dsgh2 @Rasoul20sh Can you try building this branch with a fix?

Yes, Of course. I was building it using visual studio 2019. But with that version I faced with this kind of Errors:
#error: -- unsupported Microsoft Visual Studio version! Only the versions between 2013 and 2017 (inclusive) are supported! opencv_cudaarithm C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\include\crt\host_config.h 143

@dsgh2

This comment has been minimized.

Copy link

@dsgh2 dsgh2 commented Dec 24, 2019

@dsgh2 @Rasoul20sh Can you try building this branch with a fix?
C2912 'unsigned int cv::dnn::cuda4dnn::csl::device::detail::getThreadIdx<2>(void) opencv_dnn d:__dev_data\opencv420\source\modules\dnn\src\cuda\grid_stride_range.hpp 33

It still shows above error.

@dsgh2

This comment has been minimized.

Copy link

@dsgh2 dsgh2 commented Dec 26, 2019

template <> inline device auto

This works, Thank you!

@YashasSamaga

This comment has been minimized.

Copy link
Owner Author

@YashasSamaga YashasSamaga commented Dec 26, 2019

@dsgh2 Sorry, I forgot to reply back. I don't really know why that change works. Can you confirm if you are able to use the CUDA backend?

@Rasoul20sh The only CUDA module that the DNN module requires is cudev (in fact, it's a false dependency because of the current CMake scripts). If you do not need other CUDA modules, like opencv_cudaarithm which is causing the error, remove them from the build.

@dsgh2

This comment has been minimized.

Copy link

@dsgh2 dsgh2 commented Dec 27, 2019

I have applied stephenvidler's solution on your branch.
I have just finished my build and that should be confirmed as you told.

FYI, I have seen error in debug mode whereas release mode was built well.
The problem has been solved by changing to INCREMENTAL BUILDING to NO

@YashasSamaga

This comment has been minimized.

Copy link
Owner Author

@YashasSamaga YashasSamaga commented Dec 31, 2019

@dsgh2 @Rasoul20sh @stephenvidler FYI: The build issues (related to VS15 and older versions of MSVC) have been fixed on master.

@Rasoul20sh

This comment has been minimized.

Copy link

@Rasoul20sh Rasoul20sh commented Jan 22, 2020

@YashasSamaga Thanks, it is fixed. Do you know that this implementation can support unet? I have an error when I want to set input to an unet structure.

@davisking

This comment has been minimized.

Copy link

@davisking davisking commented Jan 25, 2020

@YashasSamaga, out of curiosity, what version of TF was used for these benchmarks and was XLA turned on? Supposedly XLA makes TF run a lot faster (see https://medium.com/tensorflow/pushing-the-limits-of-gpu-performance-with-xla-53559db8e473) but I'm a little skeptical and have seen more mixed results than that medium post would suggest.

@YashasSamaga

This comment has been minimized.

Copy link
Owner Author

@YashasSamaga YashasSamaga commented Jan 25, 2020

@davisking If I remember correctly, it was Tensorflow 1.14. XLA wasn't turned on for the benchmark. I tried to benchmark with XLA enabled (using XLA_GPU as device) but kept getting CUDNN_STATUS_INTERNAL_ERROR.

@davisking

This comment has been minimized.

Copy link

@davisking davisking commented Jan 27, 2020

Not super surprising. XLA has some funky requirements it seems. Thanks :)

@eyebies

This comment has been minimized.

Copy link

@eyebies eyebies commented Feb 12, 2020

@YashasSamaga Does dnn yolo support custom anchors, classes? I seem to have parsing issues?

@YashasSamaga

This comment has been minimized.

Copy link
Owner Author

@YashasSamaga YashasSamaga commented Feb 13, 2020

@YashasSamaga Yes, if the OpenCV CPU backend can run your model, it'll most likely work with the CUDA backend as well. If are unable to run/import your model into OpenCV DNN, you might need to make an issue requesting for a feature.

@isra60

This comment has been minimized.

Copy link

@isra60 isra60 commented Feb 13, 2020

Hi @YashasSamaga , can you take a look to this?? opencv/opencv#16420
There seems to be something wrong with the detections mat

@YashasSamaga

This comment has been minimized.

Copy link
Owner Author

@YashasSamaga YashasSamaga commented Feb 13, 2020

@isra60

csresnext50-panet-spp-original-optimal.cfg requires components such as grouped convolutions which are not yet supported in the darknet importer. Hence, even though the network is imported successfully, the outputs are not correct.

enet-coco.cfg requires PR16436 which I will mostly finish off this weekend.

lp-recognition.cfg uses some options in region block (bias_match and jitter) which don't seem to be handled by the darknet importer. I'm not sure what these are.

@whittenator

This comment has been minimized.

Copy link

@whittenator whittenator commented Mar 9, 2020

Good Morning,

I have everything set up and working great! Fantastic work! My system has 2 GPUs, is there a way to specify which one to use?

Thanks

@YashasSamaga

This comment has been minimized.

Copy link
Owner Author

@YashasSamaga YashasSamaga commented Mar 9, 2020

@tomingliu

This comment has been minimized.

Copy link

@tomingliu tomingliu commented Apr 26, 2020

thanks for your great work! can you help me on below issue?
now, I'm trying to build v4.2.0 according to your instructions on CentOS 7, but failed, pls refer to my command

cmake -D CMAKE_BUILD_TYPE=RELEASE
-D CMAKE_INSTALL_PREFIX=/usr
-D INSTALL_PYTHON_EXAMPLES=ON
-D INSTALL_C_EXAMPLES=OFF
-D OPENCV_ENABLE_NONFREE=ON
-D WITH_CUDA=ON
-D WITH_CUDNN=ON
-D WITH_CUFFT=ON
-D OPENCV_DNN_CUDA=ON
-D ENABLE_FAST_MATH=1
-D CUDA_FAST_MATH=1
-D CUDA_ARCH_BIN=7.5
-D CUDA_ARCH_PTX=""
-D WITH_CUBLAS=ON
-D OPENCV_EXTRA_MODULES_PATH=../../opencv_contrib-4.2.0/modules -DBUILD_opencv_freetype=ON
-D HAVE_opencv_python3=ON
-D BUILD_opencv_apps=OFF
-D BUILD_TESTS=ON
-D BUILD_EXAMPLES=OFF ..

I can build code tree successful, but when I do dnn test and I got below failed

bin/opencv_test_dnn

[ FAILED ] 455 tests, listed below:
[ FAILED ] Test_Caffe.memory_read
[ FAILED ] Test_Caffe.read_gtsrb
[ FAILED ] Test_Caffe.read_googlenet
[ FAILED ] Test_Caffe.multiple_inputs
[ FAILED ] Test_Caffe.shared_weights
[ FAILED ] Reproducibility_FCN.Accuracy
[ FAILED ] Reproducibility_SSD.Accuracy
[ FAILED ] Reproducibility_AlexNet_fp16.Accuracy
[ FAILED ] Reproducibility_GoogLeNet_fp16.Accuracy
[ FAILED ] Test_Darknet.read_tiny_yolo_voc
[ FAILED ] Test_Darknet.read_yolo_voc
[ FAILED ] Test_Darknet.read_yolo_voc_stream
[ FAILED ] Layer_LSTM_Test_Accuracy_with_.CaffeRecurrent
[ FAILED ] Layer_RNN_Test_Accuracy_with_.CaffeRecurrent
[ FAILED ] NMS.Accuracy
[ FAILED ] Test_TensorFlow.two_inputs
[ FAILED ] Test_TensorFlow.Mask_RCNN
[ FAILED ] Torch_Importer.simple_read
[ FAILED ] DNNTestNetwork.MobileNet_SSD_Caffe/0, where GetParam() = CUDA/CUDA
[ FAILED ] DNNTestNetwork.MobileNet_SSD_Caffe/1, where GetParam() = CUDA/CUDA_FP16
[ FAILED ] DNNTestNetwork.MobileNet_SSD_Caffe_Different_Width_Height/0, where GetParam() = CUDA/CUDA
[ FAILED ] DNNTestNetwork.MobileNet_SSD_Caffe_Different_Width_Height/1, where GetParam() = CUDA/CUDA_FP16
[ FAILED ] DNNTestNetwork.MobileNet_SSD_v1_TensorFlow/0, where GetParam() = CUDA/CUDA
[ FAILED ] DNNTestNetwork.MobileNet_SSD_v1_TensorFlow/1, where GetParam() = CUDA/CUDA_FP16
[ FAILED ] DNNTestNetwork.MobileNet_SSD_v1_TensorFlow_Different_Width_Height/0, where GetParam() = CUDA/CUDA
[ FAILED ] DNNTestNetwork.MobileNet_SSD_v1_TensorFlow_Different_Width_Height/1, where GetParam() = CUDA/CUDA_FP16
[ FAILED ] DNNTestNetwork.MobileNet_SSD_v2_TensorFlow/0, where GetParam() = CUDA/CUDA
[ FAILED ] DNNTestNetwork.MobileNet_SSD_v2_TensorFlow/1, where GetParam() = CUDA/CUDA_FP16
[ FAILED ] DNNTestNetwork.SSD_VGG16/0, where GetParam() = CUDA/CUDA
[ FAILED ] DNNTestNetwork.SSD_VGG16/1, where GetParam() = CUDA/CUDA_FP16

...

bin/opencv_perf_dnn

...

[ FAILED ] 3 tests, listed below:
[ FAILED ] DNNTestNetwork.YOLOv3/0, where GetParam() = CUDA/CUDA
[ FAILED ] DNNTestNetwork.YOLOv3/1, where GetParam() = CUDA/CUDA_FP16
[ FAILED ] DNNTestNetwork.YOLOv3/2, where GetParam() = OCV/CPU

here attached my cmake output log file

@YashasSamaga

This comment has been minimized.

Copy link
Owner Author

@YashasSamaga YashasSamaga commented Apr 26, 2020

@tomingliu You need to clone opencv_extra. Run download_models.py in opencv_extra/testdata/dnn which will download around 13GB (if I remember correctly). Set OPENCV_TEST_DATA_PATH environment variable to /locaton/to/opencv_extra/testdata. Now run the tests.

@tomingliu

This comment has been minimized.

Copy link

@tomingliu tomingliu commented Apr 26, 2020

Thank you for your quickly reply! I'll try the extra process later, Actually I have done test with python script before dnn tesing

net = cv.dnn.readNetFromDarknet(modelConfiguration, modelWeights)
net.setPreferableBackend(cv.dnn.DNN_BACKEND_CUDA)
net.setPreferableTarget(cv.dnn.DNN_TARGET_CUDA_FP16)

blob = cv.dnn.blobFromImage(frame, 1/255, (inpWidth, inpHeight), [0,0,0], 1, crop=False)

net.setInput(blob)

outs = net.forward(getOutputsNames(net))

And I got some error message like this

[ INFO:0] global /usr/share/fengtu/opencv_cuda/opencv-4.2.0/modules/videoio/src/videoio_registry.cpp (187) VideoBackendRegistry VIDEOIO: Enabled backends(6, sorted by priority): FFMPEG(1000); GSTREAMER(990); INTEL_MFX(980); V4L2(970); CV_IMAGES(960); CV_MJPEG(950)
[ INFO:0] global /usr/share/fengtu/opencv_cuda/opencv-4.2.0/modules/core/src/ocl.cpp (891) haveOpenCL Initialize OpenCL runtime...
[ INFO:0] global /usr/share/fengtu/opencv_cuda/opencv-4.2.0/modules/dnn/src/dnn.cpp (2204) initCUDABackend CUDA backend will fallback to the CPU implementation for the layer "_input" of type NetInputLayer

Could you pls give me more hints to know what options/instructions missed during build code tree?

@YashasSamaga

This comment has been minimized.

Copy link
Owner Author

@YashasSamaga YashasSamaga commented Apr 26, 2020

@tomingliu It looks correct. The last INFO message indicates that you have built correctly. The last INFO is a genuine message because the CUDA backend does not have support for the input layer. The preprocessing is done on CPU and then moved to the GPU. In most cases, it's skipped fully (like in your case where you haven't specified any scaling or normalization in setInput).

@tomingliu

This comment has been minimized.

Copy link

@tomingliu tomingliu commented Apr 26, 2020

@YashasSamaga Okay, thanks for your prompt!

@sunilsomarajan

This comment has been minimized.

Copy link

@sunilsomarajan sunilsomarajan commented May 2, 2020

@YashasSamaga I just compiled OpenCV 4.3.0 on 2 platforms Intel and Jetson Nano. Since both these platforms of mine have Cuda capable GPUs and I have enabled in this in the build. (-D WITH_CUDA=ON -D WITH_CUDNN=ON -D OPENCV_DNN_CUDA=ON -D ENABLE_FAST_MATH=1 -D CUDA_FAST_MATH=1 -D CUDA_ARCH_BIN="5.3,6.1,7.2" -D WITH_CUBLAS=ON)

--- update --- apologies for not reading the history here). Is it due to this comment of yours here (https://gist.github.com/YashasSamaga/a84cf2826ab2dc755005321fe17cd15d#gistcomment-3099131)

I am using a precompiled Mobile-SSDV2 Tensorflow model from Google Zoo and running a simple object detection example.

I am seeing really good performance on but I am seeing something strange. I see better performance when the lines below are commented. I see this both on Intel with 1080 Ti GPU and Jetson NANO

#cvNet.setPreferableBackend(cv.dnn.DNN_BACKEND_CUDA)
#cvNet.setPreferableTarget(cv.dnn.DNN_TARGET_CUDA)

My CMAKE settings are here : https://gist.github.com/sunilsomarajan/a6133c13aab3b1e0f6fa95f832fbbe1a
Sample code is here : https://gist.github.com/sunilsomarajan/896f104d1c86e323adadd79550087098

Using Tensorflow version -> 1.15
Output on Jetson NANO with DNN_BACKEND not set

1 0.99846137 person
64 0.4629537 potted plant
64 0.35649464 potted plant
72 0.7356328 tv
73 0.6055175 laptop
84 0.8533889 book
84 0.7150121 book
84 0.5513182 book
Time per inference: 4.033365 ms

Output on Jetson NANO with DNN_BACKED set (the 2 lines above uncommented)

[INFO] setting preferable backend and target to CUDA...
1 0.99846137 person
64 0.4629538 potted plant
64 0.35649443 potted plant
72 0.73563325 tv
73 0.6055164 laptop
84 0.85338926 book
84 0.71501255 book
84 0.55131876 book
Time per inference: 18.863149 ms

Could you let me know what could be going on ? Great work BTW and this gives us an alternative on Jetson platforms.

@YashasSamaga

This comment has been minimized.

Copy link
Owner Author

@YashasSamaga YashasSamaga commented May 3, 2020

@sunilsomarajan This problem will mostly be fixed in the next release. There is a fix for this issue here: opencv/opencv#16900

@sunilsomarajan

This comment has been minimized.

Copy link

@sunilsomarajan sunilsomarajan commented May 3, 2020

@YashasSamaga thank you. I will merge that change locally. Excellent work.

@sunilsomarajan

This comment has been minimized.

Copy link

@sunilsomarajan sunilsomarajan commented May 4, 2020

@YashasSamaga, I pulled in your change on top of 4.3.0. The difference is significant for mobilenet_ssd v2 on Intel with 1080 GTX Ti . I have to still test on Jetson platforms.

@sunilsomarajan

This comment has been minimized.

Copy link

@sunilsomarajan sunilsomarajan commented May 4, 2020

@YashasSamaga, some results for a single image inference (Mean of 10 runs) on Intel

Intel with 1080 GTX Ti : Mobilenet SSD V2

python opencv_mobile_ssd.py (4.3.0)
[INFO] setting preferable backend and target to CUDA...
1 0.99846137 person
64 0.46295404 potted plant
64 0.35649517 potted plant
72 0.735632 tv
73 0.60551673 laptop
84 0.8533891 book
84 0.71501225 book
84 0.55131793 book
Time per inference: 65.505838 ms
FPS: 15.265814842071778

python opencv_mobile_ssd.py (4.3.0+ opencv/opencv#16900)
[INFO] setting preferable backend and target to CUDA...
1 0.99846137 person
64 0.4629534 potted plant
64 0.35649383 potted plant
72 0.7356325 tv
73 0.6055173 laptop
84 0.85338897 book
84 0.7150124 book
84 0.55131817 book
Time per inference: 5.713272 ms
FPS: 175.0310476063297

@JulienMaille

This comment has been minimized.

Copy link

@JulienMaille JulienMaille commented May 4, 2020

I have a question since I started working with inference on Cuda devices: is there a reason why the cudnn dll is so big?
This is a real pain to redistribute a 200+Mb, more over when you know that only a fraction of your users has a compatible GPU.
Is there a way to make it lighter? In comparison OpenVino redist is much smaller.

@YashasSamaga

This comment has been minimized.

Copy link
Owner Author

@YashasSamaga YashasSamaga commented May 4, 2020

@JulienMaile cuDNN is being broken into pieces in cuDNN 8.0. OpenCV cannot really do anything to make it lighter other than adding replacements for the services cuDNN provides (which would make cuDNN optional).

@JulienMaille

This comment has been minimized.

Copy link

@JulienMaille JulienMaille commented May 4, 2020

Thanks for your reply. Where can I find information about cuDNN 8.0?

@YashasSamaga

This comment has been minimized.

Copy link
Owner Author

@YashasSamaga YashasSamaga commented May 4, 2020

@JulieanMaile Looks like they have removed it from their release notes page. They had posted release notes for early access cuDNN 8.0.x.x weeks ago. I can find traces in google (search for "cuDNN 8.0 early access"). When I had gone through the documentation, they seemed to have split cuDNN into 6 libraries: 3 for inference and 3 for training. Each category had something like basic ops, cnn and advanced (mostly RNN stuff).

@JulienMaille

This comment has been minimized.

Copy link

@JulienMaille JulienMaille commented May 16, 2020

Hi @YashasSamaga that's also what I found. Any news since the recent GTC show?

@ynioba

This comment has been minimized.

Copy link

@ynioba ynioba commented May 23, 2020

Hi @YashasSamaga I have a question since I have two GPU last week. How to make another GPU work ? Thanks!

@YashasSamaga

This comment has been minimized.

Copy link
Owner Author

@YashasSamaga YashasSamaga commented May 24, 2020

@ynioba

There are many ways to make use of multiple GPUs. Here is one which I think is the safest and the least complex solution. It makes use of the fact that the CUDA runtime library maintains a separate CUDA context for each CPU thread.

Suppose you have N devices.

Create N threads.
Assign a CUDA device to each thread by calling cudaSetDevice or cv::cuda::setDevice in that thread. Each thread is now associated with a device.
You can create any number of cv::dnn::Net objects in any of those threads and the network will use the device associated with that thread for memory and computation.

From opencv/opencv#14827

@ynioba

This comment has been minimized.

Copy link

@ynioba ynioba commented May 24, 2020

Thanks for your reply, the code works well,you forever happy is my greatest wish,thanks again for your reply.

@goodtogood

This comment has been minimized.

Copy link

@goodtogood goodtogood commented May 26, 2020

@YashasSamaga
thanks for your great efforts.
what's the efficient way to use cv dnn under multi-thread condation? e.g. web request
thanks!

@YashasSamaga

This comment has been minimized.

Copy link
Owner Author

@YashasSamaga YashasSamaga commented May 26, 2020

@goodtogood I didn't understand your question. Most of the computation is carried out on GPU. You can have multiple instances of cv::dnn::Net for the same GPU. This allows you to extract more from each GPU by reducing GPU idle time (big improvements in some cases like in opencv/opencv#17365 (comment)).

@goodtogood

This comment has been minimized.

Copy link

@goodtogood goodtogood commented May 26, 2020

@YashasSamaga
sorry for my unclear question.
actually I want to set up one inference service via web app.
it is known that the initialization of model often costs too much time.
first time, I tried init only one dnn::Net object,
It was shared among threads, but it crashed.
seemed that it's not thread safety.
then I made a pool of dnn::Net ,
It includes some dnn::Net obj initialized in advanced.
a free obj will be taken from the pool for inference.
Is this way correct ? is there a better way to handle this case?
thanks a lot!

@YashasSamaga

This comment has been minimized.

Copy link
Owner Author

@YashasSamaga YashasSamaga commented May 26, 2020

@goodtogood That sounds reasonable. You can also try to do inferences in batches if you don't have tight latency requirements. You can have cv::dnn::Net objects initialized for single image inference, batch of two and maybe even four. The throughput increases dramatically as you increase the batch size.

Here are some stats for YOLOv4 on RTX 2080 Ti. The batched inference gives an almost 2x increase in FPS.

Input Size Darknet FP16 OCV FP32 FPS OCV FP16 FPS OCV FP32 batch = 4 OCV FP16 batch = 4
320 x 320 105.8 129.2 171.2 198 384
416 x 416 85.6 99.9 146 139.6 260.5
512 x 512 71.8 90.3 125.6 112.8 190.5
608 x 608 56.7 56 103.2 68.5 133
@goodtogood

This comment has been minimized.

Copy link

@goodtogood goodtogood commented May 26, 2020

@YashasSamaga
thank you so much for details.
as for FP16, I'v tested using dnn of OCV,
code copied from your gist link
I didn't get a significant difference (GPU RTX 2070S).

YOLOv4 608x608 batch=1
OCV FP32 22fps
OCV FP16 26fps

a little weird! Is it normal ?

@YashasSamaga

This comment has been minimized.

Copy link
Owner Author

@YashasSamaga YashasSamaga commented May 26, 2020

@goodtogood Can you share the exact code you used? 22 or 26FPS seems too less for RTX 2070S.

@YashasSamaga

This comment has been minimized.

Copy link
Owner Author

@YashasSamaga YashasSamaga commented May 26, 2020

@goodtogood Set nms_threshold=0 in all [yolo] blocks in yolov4.cfg. NMS is carried out on CPU and is very inefficient when done during the inference. It's best to disable NMS and perform it after the inference finishes. You will gain significant additional FPS. You can find example code here: https://gist.github.com/YashasSamaga/e2b19a6807a13046e399f4bc3cca3a49

@goodtogood

This comment has been minimized.

Copy link

@goodtogood goodtogood commented May 26, 2020

@YashasSamaga

Driver: 441.22 CUDA: 10.2 CUDNN: 7.6.5
Opencv commit 713577
Windows8.1 64bit VS2019 OCV4.3
It's almost copied from your code, just modified the benchmark num and backend.
thanks!

YOLO v4
[CUDA FP32]
init >> 1329.51ms
inference >> min = 45.596ms, max = 49.184ms, mean = 46.7278ms, stddev = 0.57918ms
[CUDA FP16]
init >> 865.449ms
inference >> min = 37.418ms, max = 43.093ms, mean = 39.4826ms, stddev = 1.24976ms

#include <iostream>
#include <algorithm>
#include <vector>
#include <chrono>
#include <numeric>

#include <opencv2/dnn.hpp>
#include <opencv2/highgui.hpp>
#include "benchmark.hpp"

#define USE_RANDOM_IMAGES

constexpr auto default_batch_size = 1;

struct mask_type {
    int backend;
    int target;
};

struct config_type {
    std::string name;
    int backend;
    int target;
};

std::vector<config_type> backends = {
    //{"OCV CPU", cv::dnn::DNN_BACKEND_OPENCV, cv::dnn::DNN_TARGET_CPU},
    //{"OCV OpenCL", cv::dnn::DNN_BACKEND_OPENCV, cv::dnn::DNN_TARGET_OPENCL},
    //{"OCV OpenCL FP16", cv::dnn::DNN_BACKEND_OPENCV, cv::dnn::DNN_TARGET_OPENCL_FP16},
    //{"IE CPU", cv::dnn::DNN_BACKEND_INFERENCE_ENGINE, cv::dnn::DNN_TARGET_CPU},

    {"CUDA FP32", cv::dnn::DNN_BACKEND_CUDA, cv::dnn::DNN_TARGET_CUDA},
    {"CUDA FP16", cv::dnn::DNN_BACKEND_CUDA, cv::dnn::DNN_TARGET_CUDA_FP16}
};

std::vector<cv::Mat> image_samples;

template <class T>
auto to_milliseconds(const T& duration) {
    return std::chrono::duration_cast<std::chrono::milliseconds>(duration);
}

template <class T>
auto to_microseconds(const T& duration) {
    return std::chrono::duration_cast<std::chrono::microseconds>(duration);
}

struct perf_result_t
{
    using duration = std::chrono::microseconds;

    duration init_time;
    std::vector<duration> runtimes;
};

template <std::size_t BENCHMARK_RUNS, std::size_t WARMUP_RUNS>
auto run_network(
    const std::string& model, const std::string& config,
    const cv::Mat& blob,
    const std::vector<std::string>& output_names_,
    int backend, int target)
{
    auto net = cv::dnn::readNet(model, config);
    net.setPreferableBackend(backend);
    net.setPreferableTarget(target);

    auto output_names = output_names_;
    if (output_names.empty())
        output_names = net.getUnconnectedOutLayersNames();

    std::vector<cv::Mat> output_mats;
    auto init_time = benchmark([&] {
        net.setInput(blob);
        net.forward(output_mats, output_names);
        });

    for (int i = 0; i < WARMUP_RUNS; i++)
    {
        net.setInput(blob);
        net.forward(output_mats, output_names);
    }

    perf_result_t result;
    result.init_time = init_time;
    result.runtimes.reserve(BENCHMARK_RUNS);

    for (int i = 0; i < BENCHMARK_RUNS; i++)
    {
        net.setInput(blob);
        auto inference_time = benchmark([&] {
            net.forward(output_mats, output_names);
            });

        result.runtimes.push_back(inference_time);
    }

    return result;
}

void bench_network(
    const std::string& model, const std::string& config,
    cv::Size input_size,
    const std::vector<std::string>& output_names = {},
    int count = default_batch_size,
    std::vector<mask_type> mask = {})
{
#ifndef USE_RANDOM_IMAGES
    assert(count <= image_samples.size());
#endif

    std::vector<cv::Mat> images;
    for (int i = 0; i < count; i++)
    {
#ifdef USE_RANDOM_IMAGES
        cv::Mat image(input_size, CV_32FC3);
        cv::randu(image, cv::Scalar(0, 0, 0), cv::Scalar(255, 255, 255));
        images.push_back(image);
#else
        images.push_back(image_samples[i]);
#endif
    }

    cv::Mat blob = cv::dnn::blobFromImages(images, 1.0f, input_size, 0.0f);

    for (auto c : backends) {
        auto backend = c.backend;
        auto target = c.target;

        bool skip = [backend, target, mask] {
            for (auto m : mask) {
                if (m.backend == backend && m.target == target)
                    return true;
                if (m.backend == backend && m.target == -1)
                    return true;
                if (m.backend == -1 && m.target == target)
                    return true;
            }

            return false;
        } ();

        if (skip)
            continue;

        try {
            constexpr int WARMUP_RUNS = 3;
            constexpr int BENCHMARK_RUNS = 400;
            auto result = run_network<BENCHMARK_RUNS, WARMUP_RUNS>(model, config, blob, output_names, backend, target);

            float init_time = to_microseconds(result.init_time).count() / 1000.0;

            std::vector<float> runtimes;
            for (auto r : result.runtimes)
                runtimes.push_back(to_microseconds(r).count() / 1000.0);

            auto sum = std::accumulate(std::begin(runtimes), std::end(runtimes), 0.0f);
            auto squared_sum = std::inner_product(std::begin(runtimes), std::end(runtimes), std::begin(runtimes), 0.0f);

            auto min = *std::min_element(std::begin(runtimes), std::end(runtimes));
            auto max = *std::max_element(std::begin(runtimes), std::end(runtimes));
            auto mean = sum / runtimes.size();
            auto stddev = std::sqrt(squared_sum / runtimes.size() - mean * mean);

            std::cout << '[' << c.name << "]" << '\n'
                << "\tinit >> " << init_time << "ms" << '\n'
                << "\tinference >> " << "min = " << min << "ms, max = " << max << "ms, mean = " << mean << "ms, stddev = " << stddev << "ms" << std::endl;
        }
        catch (const std::exception& ex) {
            std::cout << ex.what() << std::endl;
            return;
        }
    }

    std::cout << std::endl;
}


void bench_yolo_v4()
{
    std::cout << "YOLO v4\n";
    bench_network("./yolov4.cfg", "./yolov4.weights", cv::Size(608, 608));
    std::cout << std::endl;
}


int main()
{    
    bench_yolo_v4();
    return 0;
}
@goodtogood

This comment has been minimized.

Copy link

@goodtogood goodtogood commented May 26, 2020

@goodtogood Set nms_threshold=0 in all [yolo] blocks in yolov4.cfg. NMS is carried out on CPU and is very inefficient when done during the inference. It's best to disable NMS and perform it after the inference finishes. You will gain significant additional FPS. You can find example code here: https://gist.github.com/YashasSamaga/e2b19a6807a13046e399f4bc3cca3a49

YOLO v4
[CUDA FP32]
        init >> 1245.76ms
        inference >> min = 29.934ms, max = 31.181ms, mean = 30.3622ms, stddev = 0.207436ms
[CUDA FP16]
        init >> 876.087ms
        inference >> min = 22.916ms, max = 28.212ms, mean = 24.5076ms, stddev = 1.09143ms

after setting nms_threshold=0 of three [yolo] layers,
the performance improved a lot.
thank you so much for your patient explanation.

@JulienMaille

This comment has been minimized.

Copy link

@JulienMaille JulienMaille commented Jun 11, 2020

@YashasSamaga I missed the release of cudnn 8.0.0 RC a week ago.
Seems like there are still huge inference dlls. No sure which ones we need. Have you tried compiling OpenCV with this release?
image

@matt-sharp

This comment has been minimized.

Copy link

@matt-sharp matt-sharp commented Apr 9, 2021

@goodtogood That sounds reasonable. You can also try to do inferences in batches if you don't have tight latency requirements. You can have cv::dnn::Net objects initialized for single image inference, batch of two and maybe even four. The throughput increases dramatically as you increase the batch size.

How do we control the batch size? Is this what we feed into blobfromimages?

@YashasSamaga

This comment has been minimized.

Copy link
Owner Author

@YashasSamaga YashasSamaga commented Apr 9, 2021

How do we control the batch size? Is this what we feed into blobfromimages?

@matt-sharp Yes. Note that changing input shapes will cause reinitialization (which is time-consuming). Fix a batch size and use it throughout (and in case you have just three images but initialized for a batch size of four, pad a dummy zero image and make a batch of four to avoid reinitialization).

You can also initialize multiple networks for different batch sizes if your GPU memory permits. You can have one net object for single image inference, another for a batch of four, and another for eight. You can use all networks simultaneously (if you want) and use a smaller batch size when you do not have enough jobs to populate the bigger batches.

@matt-sharp

This comment has been minimized.

Copy link

@matt-sharp matt-sharp commented Apr 9, 2021

@YashasSamaga are there any guidelines for the optimal batch size? Which backend DNN_TARGET_CUDA_FP16 or DNN_TARGET_CUDA_FP32?
I'm using 1 x Tesla v100, CUDA 11.2, CuDNN 7.6.5, YOLOv4, image size 608 x 608.

Is there any benefit to initializing multiple networks for the same model and running in parallel?
Also, is it possible to run batch inference with the high level Detection Model API since we don't feed a blob into this?

@YashasSamaga

This comment has been minimized.

Copy link
Owner Author

@YashasSamaga YashasSamaga commented Apr 9, 2021

are there any guidelines for the optimal batch size?

@matt-sharp The performance varies across devices. The latency increases with batch size along with throughput. The batch size you use is largely dependent on your latency requirements. It's a tradeoff between latency and throughput. I'd recommend trying out different batch sizes and choose a batch size that provides substantial improvement compared to the next smaller batch size. The throughput generally always increases with batch size but with diminishing returns.

Which backend DNN_TARGET_CUDA_FP16 or DNN_TARGET_CUDA_FP32?

FP16 works great for YOLOv4 with practically no loss in detection performance. Your GPU supports FP16 target and I recommend it.

Is there any benefit to initializing multiple networks for the same model and running in parallel?

Yes, the GPU is used during inference but stays idle during pre/postprocessing on CPU. If you use multiple threads with one network initialized per thread, the GPU inference workload from one thread will keep the GPU busy while another thread is busy doing the pre/postprocessing. You can also pipeline your entire process into stages: preprocessing, DNN inference, postprocess. With a pipeline, you can keep all stages of the pipeline busy by processing different frames in each stage. For example, you will be preprocessing the next frame while the GPU is computing the forward pass for the current frame and simultaneously you would be postprocessing the previous frame. This increases throughput.

If you're initializing networks with large batch sizes, say 32, then most of the computation goes waste if you had to perform inference on just one image. But reinitializing the network to work with a single image would cost much more time. The idea of initializing multiple networks with different batch sizes is to let you improve latency when you cannot fill an entire batch. If you had just 3 images, you can use the network initialized to process four images and later if you have to process 8 images, you can use a network initialized to work on batches of eight.

Also, is it possible to run batch inference with the high level Detection Model API since we don't feed a blob into this?

Batching is not supported in the high level model API. You can track the feature request here: opencv/opencv#17838

@matt-sharp

This comment has been minimized.

Copy link

@matt-sharp matt-sharp commented Apr 14, 2021

@YashasSamaga thanks for your reply.

Please can I confirm that the Detection Model API returns bounding box co-ordinates in the form absolute(top, left, width, height) whereas net.forward returns absolute(centre x, centre y, width, height)?

@YashasSamaga

This comment has been minimized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment