Skip to content

Instantly share code, notes, and snippets.

@atinfinity
Last active December 5, 2019 10:21
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save atinfinity/49ff9dbf0bb617331cc07d35cd8a5e66 to your computer and use it in GitHub Desktop.
Save atinfinity/49ff9dbf0bb617331cc07d35cd8a5e66 to your computer and use it in GitHub Desktop.
GpuMatのcudaMallocPitchが遅くなる再現コード
#include <opencv2/core.hpp>
#include <opencv2/core/cuda.hpp>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <iostream>
int main(int argc, const char * argv[])
{
cudaFree(0); // dummy call
const size_t width = 256;
const size_t height = 256;
const size_t elemSize = 12;
for (int i = 0; i < 5; i++)
{
#if 1
cv::cuda::GpuMat d_img(cv::Size(width, height), CV_32FC3); // cudaMallocPitch is slow(only first call)
#else
size_t step = 0;
unsigned char *data = NULL;
cudaMallocPitch(&data, &step, elemSize * width, height);
cudaFree(data);
#endif
}
return 0;
}
@atinfinity
Copy link
Author

atinfinity commented Feb 18, 2018

検証環境

項目 内容
CPU Intel Core i7-6700HQ 2.60GHz
メモリ 64GB
GPU NVIDIA GeForce GTX 1060 / 6GB
OS Windows 10 Pro 64bit
Visual Studioバージョン Visual Studio 2015
CUDA CUDA 9.1

nvprofログ

GpuMat

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
      API calls:   83.17%  137.37ms         6  22.895ms  152.10us  136.31ms  cudaFree
                   15.86%  26.189ms         5  5.2377ms  277.73us  24.990ms  cudaMallocPitch
                    0.75%  1.2389ms       100  12.389us     395ns  395.06us  cuDeviceGetAttribute
                    0.21%  351.61us         2  175.80us  156.45us  195.16us  cuDeviceGetName
                    0.01%  16.592us         2  8.2960us  7.5060us  9.0860us  cuDeviceTotalMem
                    0.00%  3.5560us         3  1.1850us     395ns  2.7660us  cuDeviceGet
                    0.00%  2.3720us         4     593ns     395ns  1.1850us  cuDeviceGetCount
                    0.00%     791ns         1     791ns     791ns     791ns  cuDevicePrimaryCtxRelease
                    0.00%     790ns         1     790ns     790ns     790ns  cuDriverGetVersion
                    0.00%     790ns         1     790ns     790ns     790ns  cuInit

cudaMallocPitch直呼び

cudaMallocPitchの引数はGpuMat内部で呼ばれているものと合わせる.

            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
      API calls:   75.07%  146.77ms         6  24.461ms  188.05us  145.56ms  cudaFree
                   22.90%  44.765ms         1  44.765ms  44.765ms  44.765ms  cuDevicePrimaryCtxRelease
                    0.96%  1.8852ms         5  377.05us  318.42us  401.38us  cudaMallocPitch
                    0.84%  1.6391ms        55  29.802us     395ns  856.49us  cuDeviceGetAttribute
                    0.21%  415.21us         1  415.21us  415.21us  415.21us  cuDeviceGetName
                    0.02%  30.420us         1  30.420us  30.420us  30.420us  cuDeviceTotalMem
                    0.00%  4.7410us         2  2.3700us     395ns  4.3460us  cuDeviceGet
                    0.00%  3.1610us         3  1.0530us     395ns  1.9760us  cuDeviceGetCount

GpuMatインスタンス生成時に呼ばれるcudaMallocPitchが遅い理由は一体・・・

@keineahnung2345
Copy link

this answer could be helpful.

@atinfinity
Copy link
Author

@keineahnung2345 Thank you for your information!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment