Skip to content

Instantly share code, notes, and snippets.

Last active January 19, 2022 16:51
Show Gist options
  • Save YashasSamaga/985071dc57885348bec072b4dc23824f to your computer and use it in GitHub Desktop.
Save YashasSamaga/985071dc57885348bec072b4dc23824f to your computer and use it in GitHub Desktop.
[UNOFFICIAL] Summary of the CUDA backend in OpenCV DNN


This gist is unofficial. It was created for personal use but have kept it public in case it would be of use to others. This document is not updated regularly and may not reflect the current status of the CUDA backend.

Internal Dependencies

The minimum set of dependencies required to use the CUDA backend in OpenCV DNN is:


You might also require the following to read/write/display images and videos:


You will require the following to run the tests:


You also have to set BUILD_TESTS and BUILD_PERF_TESTS.

External Dependencies

The CUDA backend requires CUDA Toolkit (min: 9.2) and cuDNN (min: 7.5) to be installed on the system. CMake will automatically detect CUDA Toolkit and cuDNN when the following options are set:


The CUDA backend is enabled by setting the following option:


Running tests

  1. Clone opencv_extra repository
  2. cd opencv_extra/testdata/dnn
  3. python3
  4. cd path/to/opencv/repository
  5. cd build
  6. export OPENCV_TEST_DATA_PATH=/path/to/opencv_extra/testdata
  7. Run bin/opencv_test_dnn
  8. Refer to this guide to use perf tests to compare performance between versions


The CUDA backend can be selected by choosing one of the following backend/target options:

Backend Target

A CC 5.3+ device is required to use DNN_TARGET_CUDA_FP16. Note that not all CUDA devices offer high FP16 thoughput. Hence, DNN_TARGET_CUDA_FP16 may perform worse than DNN_TARGET_CUDA. You can check if your device supports high FP16 throughput in the CUDA Programming Guide.


Support Matrix

The CUDA backend uses OpenCV's CPU backend as a fallback for unsupported layers and partially supported layers with unsupported configurations.

Layer Status Note
Slice ✔️
Split ✔️
Concat ✔️
Reshape ✔️
Flatten ✔️
Resize, Interp (nearest neighbor, bilinear) ✔️
CropAndResize ✔️
Convolution 1D ✔️(OpenCV 4.5.2)
Convolution 2D ✔️
Convolution 3D ✔️
Deconvolution 2D broken
Deconvolution 3D broken
MaxPooling 1D ✔️ (OpenCV 4.5.2)
MaxPooling 2D ✔️
MaxPooling 3D ✔️
AveragePooling 1D ✔️ (OpenCV 4.5.2)
AveragePooling 2D ✔️
AveragePooling 3D ✔️
MaxPoolingWithIndices 2D ✔️
MaxPoolingWithIndices 3D ✔️
MaxUnpool 2D ✔️
MaxUnpool 3D ✔️
ROI Pooling ✔️
PSROI Pooling
LRN ✔️
InnerProduct (constant weights) ✔️
MatMul (runtime blobs) ✔️ (OpenCV 4.5.3)
Softmax ✔️
LogSoftmax ✔️
MVN ✔️ (OpenCV 4.5.0)
ReLU (with configurable negative slope) ✔️
ReLU6 (with configurable ceil and floor) ✔️
Channelwise Paramteric ReLU ✔️
Sigmoid ✔️
TanH ✔️
Swish ✔️
Mish ✔️
ELU ✔️
Abs ✔️
Power (configurable exp, scale and shift) ✔️
Batch Normalization ✔️
Const ✔️
Crop ✔️
Eltwise (sum, product, div, max) ✔️
Weighted Eltwise (sum) ✔️
Shortcut (sum) ✔️ (OpenCV 4.3.0)
Permute ✔️
ShuffleChannel ✔️
PriorBox ✔️
Reorg ✔️
Region ✔️ scale_xy parameter added in OpenCV 4.4.0
DetectionOutput ✔️ (OpenCV 4.5.0)
Normalization (L1, L2) ✔️
Shift ✔️
Padding (constant padding, reflection101 padding) ✔️
Scale ✔️
LSTM Layer
RNN Layer
Copy link

Please help me.Why open cv dnn gpu slower than cpu when i use yolov4 to detect image
opencv 4.5.1 , Cuda 11.2 , cudnn 8.1.0 , gpu 1660ti
sorry my english is bad

Can you share the code you used?

Copy link

npvu1510 commented Sep 4, 2021

It's here.
GPU many times slower than CPU. I have build and installed opencv successfully and there are no errors

import cv2
import time


net = cv2.dnn.readNet(CONFIG_FILE, WEIGHTS_FILE)

output_layer_name = net.getLayerNames()
output_layer_name = [output_layer_name[i[0] - 1] for i in net.getUnconnectedOutLayers()]
output_layer_name = net.getUnconnectedOutLayers()

blob = cv2.dnn.blobFromImage(image, 1 / 255.0, (608, 608),swapRB=True, crop=False)

start = time.time()
layerOutputs = net.forward(output_layer_name)
end = time.time()
print("[FOWARD] took {:.6f} seconds".format(end - start))

Copy link

@PhanVu1510 OpenCV DNN performs lazy initialization in the first forward pass. The first forward pass includes time to allocate memory, create handles, etc. Initializing the CUDA backend happens to be really slow compared to initializing CPU backends. Therefore, it looks like the CUDA backend is slower than CPU backend.

Ignore the first forward call and measure time from the second forward pass onwards.

Example code:

Copy link

npvu1510 commented Sep 4, 2021

Thank you!!!.
It achieves 30 frames per second for 416x416.Is there any other way to increase fps on my gpu?
Bc i want to 800x800 but it just 9-10 fps.

Copy link

YashasSamaga commented Sep 4, 2021

@PhanVu1510 You can try pipelining to gain more FPS. You can also trade latency for throughput. Batched inference will give you higher throughput with higher latency. You can also use multiple cv::dnn::Net objects to do inference in parallel. This will help minimize GPU idle time. Again, this gives higher throughput at the cost of higher latency. If your application is not latency-critical, you should try using multiple Net objects and batched inference. You might be able to get anywhere from few dozen percentage increase to doubling the FPS.

Copy link

As anyone benchmarked 3080 gpus? Last time I tried the first convolution took 30+sec!

Copy link

I want to use yolov4 p5 darknet 896x896(mentioned here but idk how to config it for 1 class.Can u help me ?
Thank u

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment