OpenCV CUDA Integation

Contents

Introduction
C++
Python
Practical Notes

Introduction

If OpenCV is compile with CUDA support, then we can access a subset of algorithms that are implemented for NVIDIA GPUs and enjoy GPU acceleration for the DNN module. However, CUDA might be not most suitable solution to our problems. Whenever we want to use CUDA implementations of certain functions we should check first if it makes sense. This means:

evaluating if using CUDA makes sense in a deployment scenario at all (hardware/cloud instance availability/costs)
benchmarking against CPUs (batch sizes, latencies, input sources/pipelines can be important)
benachmarking against integrated GPUs or GPUs other manufacturers using OpenCL via cv::UMat if suitable kernels are available

NB!: If a deployment scenario includes (small(er)) deep neural networks, other considerations such as Vulkan support etc. are to be considered as well.

For hints how to use NVIDIA GPUs (CUDA or NVENC) with opencv have a look at Frame I/O, please.

C++

A tight integration of OpenCV’s CUDA capability with other libraries seems to be a lot more common when using the C++ API. However, documentation is hardly available. When it comes to neural networks then the main purposes of using OpenCV’s CUDA functionality seems to rely on capturing frames to the GPU directly (or moving it from host to device in a different thread) and some very basic pre-processing though even many other computer vision related algorithms are available (but not for every data type). Moreover, some parts of OpenCV’s CUDA capabilities, especially interesting parts that are not exposed via it’s API, are used for the DNN (Deep Neural Network) module (the cuda version).

OpenCV CUDA Introduction

All cuda functionalities are part of the contrib repo (extra modules). This means that we may not consider all of them as stable and well maintained. These CUDA modules are:

// from https://docs.opencv.org/4.x/index.html
cudaarithm: Operations on Matrices
cudabgsegm: Background Segmentation
cudacodec: Video Encoding/Decoding
cudafeatures2d: Feature Detection and Description
cudafilters: Image Filtering
cudaimgproc: Image Processing
cudalegacy: Legacy support
cudaobjdetect: Object Detection
cudaoptflow: Optical Flow
cudastereo: Stereo Correspondence
cudawarping: Image Warping
cudev: Device layer

All OpenCV CUDA modules require/work on cv::cuda::GpuMat. Therefore, this is the most important class which is roughly equivalent to cv::Mat. The cv::Mat (cv::UMat works as well) can be uploaded to a pre-allocated cv::cuda::GpuMat using the upload function:

cv::cuda::GpuMat gFrame(frame_cv_mat);

// or
cv::cuda::GpuMat gFrame_pre_alloc(cv::Size(1280,720), CV_8UC3);
gFrame_pre_alloc.upload(frame_cv_mat)

Moving the image back from device to host requires usage of the download function:

cv::Mat frame_host(cv::Size(1280,720), CV_8UC3);
gFrame.download(frame_host);

Many CUDA implementations of OpenCV functions are identical or similar to the CPU/UMat equivalent therefore, no additional introduction is needed. NB!: Not all functions are available as CUDA functions.

Whenever a cv::cuda::GpuMat (applys to cv::Mat as well) is handed over to a different library (via a pointer), then it is vital to ensure that memory is contiguous. This can be done easily using:

// opencv
if (!gFrame.isContinuous()) {
   gFrame = gFrame.clone();
}

// libtorch
tensor_in = tensor_in.contiguous();

OpenCV GpuMat and Libtorch

When deploying neural networks using libtorch is a common thing though it seems to get replaced by tensorrt nowadays. Anyhow, there are some applications where the flexibility of libtorch is more important than speed-ups of using tensorrt and sometimes they are even combined. Depending which pre-processing algorithms are required it makes sense to use existing CUDA implementations (if available) of OpenCV and simply pass the CUDA pointer to libtorch and “convert it into a tensor” withouth additional host <-> device transfer:

// providing a fp32 tensor
cv::cuda::GpuMat gFrame_in(cv::Size(1280,720), CV_8UC3);
gFrame_in.upload(frame)

cv::cuda::GpuMat gFrame_fp32(cv::Size(1280,720), CV_32FC3);
gFrame_in.convertTo(gFrame_fp32, CV_32FC3);

// prepare for libtorch tensor conversion
int64_t step = gFrame_fp32.step/(sizeof(float));

std::vector<int64_t> tensor_shape = {
    1, // batch size (N)
    static_cast<int64_t>(gFrame_fp32.channels()), // C
    static_cast<int64_t>(gFrame_fp32.rows), // H
    static_cast<int64_t>(gFrame_fp32.cols), // W
};

std::vector<int64_t> strides = {
    1,
    1,
    step,
    static_cast<int64_t>(gFrame_fp32.channels())
};

torch::Deleter deleter; // which is essentially std::function void deleter(void* arg) {};

torch::Tensor torch_tensor = torch::from_blob(
    gFrame_fp32.data,
    tensor_shape,
    strides,
    deleter,
    torch::kCUDA);

auto tensor_shape_out = torch_tensor.sizes()
std::cout << tensor_shape_out << std::endl;

An alternative would be pre-allocating an empty tensor of correct dimension and use cudaMemcpy.

Converting a torch::Tensor to a cv::cuda::GpuMat is a less common conversion as it highly depends on the output of a neural network and what should be done with it. Applying a binary mask and display/stream it would be a simple thing to. However, even something as simple as drawing resulting resulting bounding boxes and putting some text on an image is not available by cv::cuda. It would be required to write custom CUDA kernels for that which might actually less efficient than a copy back to host and drawing bounding boxes on it and copying it back to the GPU for encoding (or displaying).

torch_tensor = torch_tensor.permute({0, 2, 3, 1});
torch_tensor = torch_tensor.squeeze(0);
cv::cuda::GpuMat gFrame_from_tensor(cv::Size(1280,720), CV_32FC3, torch_tensor.data_ptr());

Have a look at this page to learn more about moving data from/to PyTorch/libtorch.

OpenCV GpuMat and TensorFlow

Using TensorFlow with any data that was allocated outside the framework is a pain in the ass. You may have to consider using CPPFlow which hooks into TensorFlows C API and does not require TensorFlows weird build system. Other approaches that used to work with Tensorflow 1.xx do not seem to work with TensorFlow 2.xx anymore.

OpenCV GpuMat and tensorrt

Copying a cv::cuda::GpuMat into a datastructure that is suitable for tensorrt (or most models converted to tensorrt) requires chaning the memory layout up front (CHW instead of HWC). The most comon way seems to be using cv::cuda::split and copy each plane (per color channel) to a cv::cuda::GpuMat on a vector while addressing pre-allocated memory on cuda. Don’t forget that CV_32FC3 turns into CV_32FC1 during this operation! This would look like this:

std::vector<cv::cuda::GpuMat> color_channel_split;
for (size_t i = 0; i < gFrame.channels(); ++i)
{
    color_channel_split.emplace(cv::cuda::GpuMat(cv::Size(1280,720), CV_32FC1, (float *) buffers[0] + i * 1280 * 720));
}

cv::cuda::split(gFrame, color_channel_split);

buffers[0] is the first element (inputs) of a std::vector<void*> that is initialized unpacking/loading of the tensorrt engine. For a fully worked example, please have a look at this source code from learnopencv.

The most comfortable way of copying data from tensorrt back to a cv::cuda::Mat is to do a detour via libtorch. In this case tensor::from_blob would be applied in the usual way and moving it back to cv::cuda::Mat is not a big deal after that.

Python

Custom CUDA kernels and the OpenCV Python API are hardly documented at all. However, since this pull request was integrated into OpenCV, it is possible now to access images (matrices) stored on a CUDA device (type: cv2.cuda_GpuMat). This is rather practical once we want to utilized more complex computer vision algorithms provided by OpenCV (contrib) but running custom kernels afterwards is required as well.
Using cv2.GpuMat for custom CUDA kernels is not well documented and yes, it is kind of a pain in the ass to do. Further, writing custom OpenCV extensions/modules is even more time consuming unless we are super familiar with the OpenCV source code, especially the CUDA part. In general, there is a lack of documentation of the OpenCV CUDA part - at least from a practical perspective. Most algorithms are wrapped in similar classes as their CPU counterpart but sometimes there are very small differences that are super time consuming to figure out.

NB!: If we want to train a neural network, it might be more valuable to do all the pre-processing on the CPU instead on the GPU for better load balancing and higher overall performance. However, for inference it might be beneficial to go full CUDA, especially when it is possible to use GPUDirect RDMA.

OpenCV CUDA Intro

To get a rudimentary idea of what is offered with CUDA support, we can use inspect:

 ('cuda_BackgroundSubtractorMOG', cv2.cuda_BackgroundSubtractorMOG),
 ('cuda_BackgroundSubtractorMOG2', cv2.cuda_BackgroundSubtractorMOG2),
 ('cuda_BroxOpticalFlow', cv2.cuda_BroxOpticalFlow),
 ('cuda_BufferPool', cv2.cuda_BufferPool),
 ('cuda_CLAHE', cv2.cuda_CLAHE),
 ('cuda_CannyEdgeDetector', cv2.cuda_CannyEdgeDetector),
 ('cuda_CascadeClassifier', cv2.cuda_CascadeClassifier),
 ('cuda_Convolution', cv2.cuda_Convolution),
 ('cuda_CornernessCriteria', cv2.cuda_CornernessCriteria),
 ('cuda_CornersDetector', cv2.cuda_CornersDetector),
 ('cuda_DFT', cv2.cuda_DFT),
 ('cuda_DenseOpticalFlow', cv2.cuda_DenseOpticalFlow),
 ('cuda_DensePyrLKOpticalFlow', cv2.cuda_DensePyrLKOpticalFlow),
 ('cuda_DescriptorMatcher', cv2.cuda_DescriptorMatcher),
 ('cuda_DeviceInfo', cv2.cuda_DeviceInfo),
 ('cuda_DisparityBilateralFilter', cv2.cuda_DisparityBilateralFilter),
 ('cuda_Event', cv2.cuda_Event),
 ('cuda_FarnebackOpticalFlow', cv2.cuda_FarnebackOpticalFlow),
 ('cuda_FastFeatureDetector', cv2.cuda_FastFeatureDetector),
 ('cuda_Feature2DAsync', cv2.cuda_Feature2DAsync),
 ('cuda_Filter', cv2.cuda_Filter),
 ('cuda_GpuData', cv2.cuda_GpuData),
 ('cuda_GpuMat', cv2.cuda_GpuMat),
 ('cuda_GpuMatND', cv2.cuda_GpuMatND),
 ('cuda_GpuMat_Allocator', cv2.cuda_GpuMat_Allocator),
 ('cuda_HOG', cv2.cuda_HOG),
 ('cuda_HostMem', cv2.cuda_HostMem),
 ('cuda_HoughCirclesDetector', cv2.cuda_HoughCirclesDetector),
 ('cuda_HoughLinesDetector', cv2.cuda_HoughLinesDetector),
 ('cuda_HoughSegmentDetector', cv2.cuda_HoughSegmentDetector),
 ('cuda_LookUpTable', cv2.cuda_LookUpTable),
 ('cuda_NvidiaHWOpticalFlow', cv2.cuda_NvidiaHWOpticalFlow),
 ('cuda_NvidiaOpticalFlow_1_0', cv2.cuda_NvidiaOpticalFlow_1_0),
 ('cuda_NvidiaOpticalFlow_2_0', cv2.cuda_NvidiaOpticalFlow_2_0),
 ('cuda_ORB', cv2.cuda_ORB),
 ('cuda_OpticalFlowDual_TVL1', cv2.cuda_OpticalFlowDual_TVL1),
 ('cuda_SURF_CUDA', cv2.cuda_SURF_CUDA),
 ('cuda_SparseOpticalFlow', cv2.cuda_SparseOpticalFlow),
 ('cuda_SparsePyrLKOpticalFlow', cv2.cuda_SparsePyrLKOpticalFlow),
 ('cuda_StereoBM', cv2.cuda_StereoBM),
 ('cuda_StereoBeliefPropagation', cv2.cuda_StereoBeliefPropagation),
 ('cuda_StereoConstantSpaceBP', cv2.cuda_StereoConstantSpaceBP),
 ('cuda_StereoSGM', cv2.cuda_StereoSGM),
 ('cuda_Stream', cv2.cuda_Stream),
 ('cuda_TargetArchs', cv2.cuda_TargetArchs),
 ('cuda_TemplateMatching', cv2.cuda_TemplateMatching),
 ('cudacodec', <module 'cv2.cudacodec'>),
 ('cudacodec_EncoderCallBack', cv2.cudacodec_EncoderCallBack),
 ('cudacodec_EncoderParams', cv2.cudacodec_EncoderParams),
 ('cudacodec_RawVideoSource', cv2.cudacodec_RawVideoSource),
 ('cudacodec_VideoReader', cv2.cudacodec_VideoReader),
 ('cudacodec_VideoWriter', cv2.cudacodec_VideoWriter)

# members of cv2.cuda

('ALPHA_ATOP', 3),
 ('ALPHA_ATOP_PREMUL', 9),
 ('ALPHA_IN', 1),
 ('ALPHA_IN_PREMUL', 7),
 ('ALPHA_OUT', 2),
 ('ALPHA_OUT_PREMUL', 8),
 ('ALPHA_OVER', 0),
 ('ALPHA_OVER_PREMUL', 6),
 ('ALPHA_PLUS', 5),
 ('ALPHA_PLUS_PREMUL', 11),
 ('ALPHA_PREMUL', 12),
 ('ALPHA_XOR', 4),
 ('ALPHA_XOR_PREMUL', 10),
 ('BroxOpticalFlow_create', <function BroxOpticalFlow_create>),
 ('COLOR_BAYER_BG2BGR_MHT', 256),
 ('COLOR_BAYER_BG2GRAY_MHT', 260),
 ('COLOR_BAYER_BG2RGB_MHT', 258),
 ('COLOR_BAYER_GB2BGR_MHT', 257),
 ('COLOR_BAYER_GB2GRAY_MHT', 261),
 ('COLOR_BAYER_GB2RGB_MHT', 259),
 ('COLOR_BAYER_GR2BGR_MHT', 259),
 ('COLOR_BAYER_GR2GRAY_MHT', 263),
 ('COLOR_BAYER_GR2RGB_MHT', 257),
 ('COLOR_BAYER_RG2BGR_MHT', 258),
 ('COLOR_BAYER_RG2GRAY_MHT', 262),
 ('COLOR_BAYER_RG2RGB_MHT', 256),
 ('COLOR_BayerBG2BGR_MHT', 256),
 ('COLOR_BayerBG2GRAY_MHT', 260),
 ('COLOR_BayerBG2RGB_MHT', 258),
 ('COLOR_BayerGB2BGR_MHT', 257),
 ('COLOR_BayerGB2GRAY_MHT', 261),
 ('COLOR_BayerGB2RGB_MHT', 259),
 ('COLOR_BayerGR2BGR_MHT', 259),
 ('COLOR_BayerGR2GRAY_MHT', 263),
 ('COLOR_BayerGR2RGB_MHT', 257),
 ('COLOR_BayerRG2BGR_MHT', 258),
 ('COLOR_BayerRG2GRAY_MHT', 262),
 ('COLOR_BayerRG2RGB_MHT', 256),
 ('CascadeClassifier_create', <function CascadeClassifier_create>),
 ('DEVICE_INFO_COMPUTE_MODE_DEFAULT', 0),
 ('DEVICE_INFO_COMPUTE_MODE_EXCLUSIVE', 1),
 ('DEVICE_INFO_COMPUTE_MODE_EXCLUSIVE_PROCESS', 3),
 ('DEVICE_INFO_COMPUTE_MODE_PROHIBITED', 2),
 ('DYNAMIC_PARALLELISM', 35),
 ('DensePyrLKOpticalFlow_create', <function DensePyrLKOpticalFlow_create>),
 ('DescriptorMatcher_createBFMatcher',
  <function DescriptorMatcher_createBFMatcher>),
 ('DeviceInfo_ComputeModeDefault', 0),
 ('DeviceInfo_ComputeModeExclusive', 1),
 ('DeviceInfo_ComputeModeExclusiveProcess', 3),
 ('DeviceInfo_ComputeModeProhibited', 2),
 ('EVENT_BLOCKING_SYNC', 1),
 ('EVENT_DEFAULT', 0),
 ('EVENT_DISABLE_TIMING', 2),
 ('EVENT_INTERPROCESS', 4),
 ('Event_BLOCKING_SYNC', 1),
 ('Event_DEFAULT', 0),
 ('Event_DISABLE_TIMING', 2),
 ('Event_INTERPROCESS', 4),
 ('Event_elapsedTime', <function Event_elapsedTime>),
 ('FEATURE_SET_COMPUTE_10', 10),
 ('FEATURE_SET_COMPUTE_11', 11),
 ('FEATURE_SET_COMPUTE_12', 12),
 ('FEATURE_SET_COMPUTE_13', 13),
 ('FEATURE_SET_COMPUTE_20', 20),
 ('FEATURE_SET_COMPUTE_21', 21),
 ('FEATURE_SET_COMPUTE_30', 30),
 ('FEATURE_SET_COMPUTE_32', 32),
 ('FEATURE_SET_COMPUTE_35', 35),
 ('FEATURE_SET_COMPUTE_50', 50),
 ('FarnebackOpticalFlow_create', <function FarnebackOpticalFlow_create>),
 ('FastFeatureDetector_create', <function FastFeatureDetector_create>),
 ('GLOBAL_ATOMICS', 11),
 ('GpuMat_defaultAllocator', <function GpuMat_defaultAllocator>),
 ('GpuMat_setDefaultAllocator', <function GpuMat_setDefaultAllocator>),
 ('HOG_create', <function HOG_create>),
 ('HOST_MEM_PAGE_LOCKED', 1),
 ('HOST_MEM_SHARED', 2),
 ('HOST_MEM_WRITE_COMBINED', 4),
 ('HostMem_PAGE_LOCKED', 1),
 ('HostMem_SHARED', 2),
 ('HostMem_WRITE_COMBINED', 4),
 ('NATIVE_DOUBLE', 13),
 ('NVIDIA_OPTICAL_FLOW_1_0_NV_OF_PERF_LEVEL_FAST', 20),
 ('NVIDIA_OPTICAL_FLOW_1_0_NV_OF_PERF_LEVEL_MAX', 21),
 ('NVIDIA_OPTICAL_FLOW_1_0_NV_OF_PERF_LEVEL_MEDIUM', 10),
 ('NVIDIA_OPTICAL_FLOW_1_0_NV_OF_PERF_LEVEL_SLOW', 5),
 ('NVIDIA_OPTICAL_FLOW_1_0_NV_OF_PERF_LEVEL_UNDEFINED', 0),
 ('NVIDIA_OPTICAL_FLOW_2_0_NV_OF_HINT_VECTOR_GRID_SIZE_1', 1),
 ('NVIDIA_OPTICAL_FLOW_2_0_NV_OF_HINT_VECTOR_GRID_SIZE_2', 2),
 ('NVIDIA_OPTICAL_FLOW_2_0_NV_OF_HINT_VECTOR_GRID_SIZE_4', 4),
 ('NVIDIA_OPTICAL_FLOW_2_0_NV_OF_HINT_VECTOR_GRID_SIZE_8', 8),
 ('NVIDIA_OPTICAL_FLOW_2_0_NV_OF_HINT_VECTOR_GRID_SIZE_MAX', 9),
 ('NVIDIA_OPTICAL_FLOW_2_0_NV_OF_HINT_VECTOR_GRID_SIZE_UNDEFINED', 0),
 ('NVIDIA_OPTICAL_FLOW_2_0_NV_OF_OUTPUT_VECTOR_GRID_SIZE_1', 1),
 ('NVIDIA_OPTICAL_FLOW_2_0_NV_OF_OUTPUT_VECTOR_GRID_SIZE_2', 2),
 ('NVIDIA_OPTICAL_FLOW_2_0_NV_OF_OUTPUT_VECTOR_GRID_SIZE_4', 4),
 ('NVIDIA_OPTICAL_FLOW_2_0_NV_OF_OUTPUT_VECTOR_GRID_SIZE_MAX', 5),
 ('NVIDIA_OPTICAL_FLOW_2_0_NV_OF_OUTPUT_VECTOR_GRID_SIZE_UNDEFINED', 0),
 ('NVIDIA_OPTICAL_FLOW_2_0_NV_OF_PERF_LEVEL_FAST', 20),
 ('NVIDIA_OPTICAL_FLOW_2_0_NV_OF_PERF_LEVEL_MAX', 21),
 ('NVIDIA_OPTICAL_FLOW_2_0_NV_OF_PERF_LEVEL_MEDIUM', 10),
 ('NVIDIA_OPTICAL_FLOW_2_0_NV_OF_PERF_LEVEL_SLOW', 5),
 ('NVIDIA_OPTICAL_FLOW_2_0_NV_OF_PERF_LEVEL_UNDEFINED', 0),
 ('NvidiaOpticalFlow_1_0_NV_OF_PERF_LEVEL_FAST', 20),
 ('NvidiaOpticalFlow_1_0_NV_OF_PERF_LEVEL_MAX', 21),
 ('NvidiaOpticalFlow_1_0_NV_OF_PERF_LEVEL_MEDIUM', 10),
 ('NvidiaOpticalFlow_1_0_NV_OF_PERF_LEVEL_SLOW', 5),
 ('NvidiaOpticalFlow_1_0_NV_OF_PERF_LEVEL_UNDEFINED', 0),
 ('NvidiaOpticalFlow_1_0_create', <function NvidiaOpticalFlow_1_0_create>),
 ('NvidiaOpticalFlow_2_0_NV_OF_HINT_VECTOR_GRID_SIZE_1', 1),
 ('NvidiaOpticalFlow_2_0_NV_OF_HINT_VECTOR_GRID_SIZE_2', 2),
 ('NvidiaOpticalFlow_2_0_NV_OF_HINT_VECTOR_GRID_SIZE_4', 4),
 ('NvidiaOpticalFlow_2_0_NV_OF_HINT_VECTOR_GRID_SIZE_8', 8),
 ('NvidiaOpticalFlow_2_0_NV_OF_HINT_VECTOR_GRID_SIZE_MAX', 9),
 ('NvidiaOpticalFlow_2_0_NV_OF_HINT_VECTOR_GRID_SIZE_UNDEFINED', 0),
 ('NvidiaOpticalFlow_2_0_NV_OF_OUTPUT_VECTOR_GRID_SIZE_1', 1),
 ('NvidiaOpticalFlow_2_0_NV_OF_OUTPUT_VECTOR_GRID_SIZE_2', 2),
 ('NvidiaOpticalFlow_2_0_NV_OF_OUTPUT_VECTOR_GRID_SIZE_4', 4),
 ('NvidiaOpticalFlow_2_0_NV_OF_OUTPUT_VECTOR_GRID_SIZE_MAX', 5),
 ('NvidiaOpticalFlow_2_0_NV_OF_OUTPUT_VECTOR_GRID_SIZE_UNDEFINED', 0),
 ('NvidiaOpticalFlow_2_0_NV_OF_PERF_LEVEL_FAST', 20),
 ('NvidiaOpticalFlow_2_0_NV_OF_PERF_LEVEL_MAX', 21),
 ('NvidiaOpticalFlow_2_0_NV_OF_PERF_LEVEL_MEDIUM', 10),
 ('NvidiaOpticalFlow_2_0_NV_OF_PERF_LEVEL_SLOW', 5),
 ('NvidiaOpticalFlow_2_0_NV_OF_PERF_LEVEL_UNDEFINED', 0),
 ('NvidiaOpticalFlow_2_0_create', <function NvidiaOpticalFlow_2_0_create>),
 ('ORB_create', <function ORB_create>),
 ('OpticalFlowDual_TVL1_create', <function OpticalFlowDual_TVL1_create>),
 ('SHARED_ATOMICS', 12),
 ('SURF_CUDA_ANGLE_ROW', 5),
 ('SURF_CUDA_HESSIAN_ROW', 6),
 ('SURF_CUDA_LAPLACIAN_ROW', 2),
 ('SURF_CUDA_OCTAVE_ROW', 3),
 ('SURF_CUDA_ROWS_COUNT', 7),
 ('SURF_CUDA_SIZE_ROW', 4),
 ('SURF_CUDA_X_ROW', 0),
 ('SURF_CUDA_Y_ROW', 1),
 ('SURF_CUDA_create', <function SURF_CUDA_create>),
 ('SparsePyrLKOpticalFlow_create', <function SparsePyrLKOpticalFlow_create>),
 ('StereoBeliefPropagation_estimateRecommendedParams',
  <function StereoBeliefPropagation_estimateRecommendedParams>),
 ('StereoConstantSpaceBP_estimateRecommendedParams',
  <function StereoConstantSpaceBP_estimateRecommendedParams>),
 ('Stream_Null', <function Stream_Null>),
 ('TargetArchs_has', <function TargetArchs_has>),
 ('TargetArchs_hasBin', <function TargetArchs_hasBin>),
 ('TargetArchs_hasEqualOrGreater', <function TargetArchs_hasEqualOrGreater>),
 ('TargetArchs_hasEqualOrGreaterBin',
  <function TargetArchs_hasEqualOrGreaterBin>),
 ('TargetArchs_hasEqualOrGreaterPtx',
  <function TargetArchs_hasEqualOrGreaterPtx>),
 ('TargetArchs_hasEqualOrLessPtx', <function TargetArchs_hasEqualOrLessPtx>),
 ('TargetArchs_hasPtx', <function TargetArchs_hasPtx>),
 ('WARP_SHUFFLE_FUNCTIONS', 30),
 ('__doc__', None),
 ('__loader__', None),
 ('__name__', 'cv2.cuda'),
 ('__package__', None),
 ('__spec__', None),
 ('abs', <function abs>),
 ('absSum', <function absSum>),
 ('absdiff', <function absdiff>),
 ('add', <function add>),
 ('addWeighted', <function addWeighted>),
 ('alphaComp', <function alphaComp>),
 ('bilateralFilter', <function bilateralFilter>),
 ('bitwise_and', <function bitwise_and>),
 ('bitwise_not', <function bitwise_not>),
 ('bitwise_or', <function bitwise_or>),
 ('bitwise_xor', <function bitwise_xor>),
 ('blendLinear', <function blendLinear>),
 ('buildWarpAffineMaps', <function buildWarpAffineMaps>),
 ('buildWarpPerspectiveMaps', <function buildWarpPerspectiveMaps>),
 ('calcAbsSum', <function calcAbsSum>),
 ('calcHist', <function calcHist>),
 ('calcNorm', <function calcNorm>),
 ('calcNormDiff', <function calcNormDiff>),
 ('calcSqrSum', <function calcSqrSum>),
 ('calcSum', <function calcSum>),
 ('cartToPolar', <function cartToPolar>),
 ('compare', <function compare>),
 ('copyMakeBorder', <function copyMakeBorder>),
 ('countNonZero', <function countNonZero>),
 ('createBackgroundSubtractorMOG', <function createBackgroundSubtractorMOG>),
 ('createBackgroundSubtractorMOG2', <function createBackgroundSubtractorMOG2>),
 ('createBoxFilter', <function createBoxFilter>),
 ('createBoxMaxFilter', <function createBoxMaxFilter>),
 ('createBoxMinFilter', <function createBoxMinFilter>),
 ('createCLAHE', <function createCLAHE>),
 ('createCannyEdgeDetector', <function createCannyEdgeDetector>),
 ('createColumnSumFilter', <function createColumnSumFilter>),
 ('createContinuous', <function createContinuous>),
 ('createConvolution', <function createConvolution>),
 ('createDFT', <function createDFT>),
 ('createDerivFilter', <function createDerivFilter>),
 ('createDisparityBilateralFilter', <function createDisparityBilateralFilter>),
 ('createGaussianFilter', <function createGaussianFilter>),
 ('createGeneralizedHoughBallard', <function createGeneralizedHoughBallard>),
 ('createGeneralizedHoughGuil', <function createGeneralizedHoughGuil>),
 ('createGoodFeaturesToTrackDetector',
  <function createGoodFeaturesToTrackDetector>),
 ('createHarrisCorner', <function createHarrisCorner>),
 ('createHoughCirclesDetector', <function createHoughCirclesDetector>),
 ('createHoughLinesDetector', <function createHoughLinesDetector>),
 ('createHoughSegmentDetector', <function createHoughSegmentDetector>),
 ('createLaplacianFilter', <function createLaplacianFilter>),
 ('createLinearFilter', <function createLinearFilter>),
 ('createLookUpTable', <function createLookUpTable>),
 ('createMedianFilter', <function createMedianFilter>),
 ('createMinEigenValCorner', <function createMinEigenValCorner>),
 ('createMorphologyFilter', <function createMorphologyFilter>),
 ('createRowSumFilter', <function createRowSumFilter>),
 ('createScharrFilter', <function createScharrFilter>),
 ('createSeparableLinearFilter', <function createSeparableLinearFilter>),
 ('createSobelFilter', <function createSobelFilter>),
 ('createStereoBM', <function createStereoBM>),
 ('createStereoBeliefPropagation', <function createStereoBeliefPropagation>),
 ('createStereoConstantSpaceBP', <function createStereoConstantSpaceBP>),
 ('createStereoSGM', <function createStereoSGM>),
 ('createTemplateMatching', <function createTemplateMatching>),
 ('cvtColor', <function cvtColor>),
 ('demosaicing', <function demosaicing>),
 ('dft', <function dft>),
 ('divide', <function divide>),
 ('drawColorDisp', <function drawColorDisp>),
 ('ensureSizeIsEnough', <function ensureSizeIsEnough>),
 ('equalizeHist', <function equalizeHist>),
 ('evenLevels', <function evenLevels>),
 ('exp', <function exp>),
 ('findMinMax', <function findMinMax>),
 ('findMinMaxLoc', <function findMinMaxLoc>),
 ('flip', <function flip>),
 ('gammaCorrection', <function gammaCorrection>),
 ('gemm', <function gemm>),
 ('getCudaEnabledDeviceCount', <function getCudaEnabledDeviceCount>),
 ('getDevice', <function getDevice>),
 ('histEven', <function histEven>),
 ('histRange', <function histRange>),
 ('inRange', <function inRange>),
 ('integral', <function integral>),
 ('log', <function log>),
 ('lshift', <function lshift>),
 ('magnitude', <function magnitude>),
 ('magnitudeSqr', <function magnitudeSqr>),
 ('max', <function max>),
 ('meanShiftFiltering', <function meanShiftFiltering>),
 ('meanShiftProc', <function meanShiftProc>),
 ('meanShiftSegmentation', <function meanShiftSegmentation>),
 ('meanStdDev', <function meanStdDev>),
 ('merge', <function merge>),
 ('min', <function min>),
 ('minMax', <function minMax>),
 ('minMaxLoc', <function minMaxLoc>),
 ('mulAndScaleSpectrums', <function mulAndScaleSpectrums>),
 ('mulSpectrums', <function mulSpectrums>),
 ('multiply', <function multiply>),
 ('norm', <function norm>),
 ('normalize', <function normalize>),
 ('phase', <function phase>),
 ('polarToCart', <function polarToCart>),
 ('pow', <function pow>),
 ('printCudaDeviceInfo', <function printCudaDeviceInfo>),
 ('printShortCudaDeviceInfo', <function printShortCudaDeviceInfo>),
 ('pyrDown', <function pyrDown>),
 ('pyrUp', <function pyrUp>),
 ('rectStdDev', <function rectStdDev>),
 ('reduce', <function reduce>),
 ('registerPageLocked', <function registerPageLocked>),
 ('remap', <function remap>),
 ('reprojectImageTo3D', <function reprojectImageTo3D>),
 ('resetDevice', <function resetDevice>),
 ('resize', <function resize>),
 ('rotate', <function rotate>),
 ('rshift', <function rshift>),
 ('setBufferPoolConfig', <function setBufferPoolConfig>),
 ('setBufferPoolUsage', <function setBufferPoolUsage>),
 ('setDevice', <function setDevice>),
 ('split', <function split>),
 ('sqr', <function sqr>),
 ('sqrIntegral', <function sqrIntegral>),
 ('sqrSum', <function sqrSum>),
 ('sqrt', <function sqrt>),
 ('subtract', <function subtract>),
 ('sum', <function sum>),
 ('threshold', <function threshold>),
 ('transpose', <function transpose>),
 ('unregisterPageLocked', <function unregisterPageLocked>),
 ('warpAffine', <function warpAffine>),
 ('warpPerspective', <function warpPerspective>)

Since the purpose of this page is to introduce interoperability with other python libraries that support the CUDA Array Interface, we’re not going to look at any of the OpenCV CUDA modules in detail. Our main focus should be on cv2.cuda_GpuMat which should not be mistaken for cv2.UMat. In general, there exists cv2.cuda_GpuMat_Allocator as well. However, this one would be used to allocate an array on a CUDA device and upload data from host later.

But let’s start simple and upload and image from host to device memory:

import numpy as np
import cv2


img = np.random.randint(0, # min value (left closed interval)
                        256, # max value - 1 (right open interval)
                        (720,1280,3), # dimension (HWC for OpenCV)
                        dtype=np.uint8 # 8 bit image
                        )
# automatic allocation on GPU (VRAM)
img_cv2_cu = cv2.cuda_GpuMat(img)

# do something with img_cuda
# ...

# return the image to host
img_host = img_cv2_cu.download() 

This process is not free. It requires a certain time to allocate memory and upload the image as well as moving it back to the host device. Benchmarks are shown in the “Copy Data from Host to Device and Back” section.

NB!: Using the C++ API, we can display cv::cuda::GpuMat directly with cv::imshow iff OpenCV is compiled with OpenGL support. I haven’t tested this with the Python API. It works by initializing a name window first using the cv::WINDOW_OPENGL WindowFlag.

CUDA Array Interface

The CUDA Array Interface is a standardize interface similar to NumPy’s array interface. We’re not going to dive into this and simple use the (incomplete) example provided in the pull request.

# from https://github.com/opencv/opencv/pull/16513#issue-371438498

class CudaArrayInterface:
    def __init__(self, gpu_mat):
        w, h = gpu_mat.size()
        type_map = {
            cv2.CV_8U: "u1", cv2.CV_8S: "i1",
            cv2.CV_16U: "u2", cv2.CV_16S: "i2",
            cv2.CV_32S: "i4",
            cv2.CV_32F: "f4", cv2.CV_64F: "f8",
        }
        self.__cuda_array_interface__ = {
            "version": 2,
            "shape": (h, w),
            "data": (gpu_mat.cudaPtr(), False),
            "typestr": type_map[gpu_mat.type()],
            "strides": (gpu_mat.step, gpu_mat.elemSize()),
        }

It is important to note that OpenCV may uses a different memory layout than expected by some of the libraries a cv2.cuda_GpuMat is accessed from. It works best with 2D slices ;)

NB!: This seems to be a one-way street. If we have to move data back and e.g. we can’t manipulate the original array for any reason, we have to allocate a suitable cv2.cuda_GpuMat first (e.g. with img_cu_2 = img_cu.clone()) and change the data using e.g. CuPy. I haven’t managed to utilize cv2.cuda_GpuMat_Allocator for this using OpenCV 4.6.0. This approach would look a bit like this:

    img = cp.random.random((3,3)).astype(cp.float32)
    img_out = np.zeros((3,3)).astype(np.float32)
    img_out_gpumat = cv2.cuda_GpuMat(img_out)

    print(img_out_gpumat.download())

    img_out_cp = cp.asarray(CudaArrayInterface(img_out_gpumat))
    img_out_cp [:,:] = img[:,:]

    print(img_out_gpumat.download())

With C++ it is possible to access existing CUDA arrays with OpenCV or to be more exact to use an OpenCV class wrapper with existing CUDA arrays using the CUDA pointer as shown in the libtorch example above. If OpenCV’s Python API would be 100% compliant with it’s C++ API, the following example would work but it does not.

import cupy as cp
import cv2
import numpy as np

img = np.zeros((5,5),dtype=np.uint8)
img_cp = cp.asarray(img)
img_cp[0:3,:] = 5
img_cv2_cu = cv2.cuda_GpuMat(img_cp.__cuda_array_interface__['shape'],
                             cv2.CV_8U, # or cv2.CV_8UC1
                             img_cp.__cuda_array_interface__['data'][0])

print('CuPy Array')
print(img_cp)
print('CuPy CUDA Pointer:', img_cp.__cuda_array_interface__['data'][0])
print()
print('cv2.GpuMat')
print(img_cv2_cu.download())
print(f'cv2.cuda_GpuMat CUDA Pointer: {img_cv2_cu.cudaPtr()}')

CuPy Array
[[5 5 5 5 5]
 [5 5 5 5 5]
 [5 5 5 5 5]
 [0 0 0 0 0]
 [0 0 0 0 0]]
CuPy CUDA Pointer: 139633457364992

cv2.GpuMat
[[0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]
 [0 0 0 0 0]]
cv2.cuda_GpuMat CUDA Pointer: 139633457365504

Update: Onces this pull is merged it won’t be a one-way street no more and it will be possible to create new GpuMat with reference to a cuda pointer.

NB!: When analyzing a computer vision pipeline properly, it might not be necessary to copy processed data back to a cv2.cuda_GpuMat.

Integration with CuPy

NB!: RAPIDS, the CUDA equivalent of (perhaps) SciPy, pandas and scikit-learn, seems to use CuPy as their basis and so does Chainer. Therefore, it is very practical that we can access cv2.cuda_GpuMat via the CudaArrayInterface.

import cupy as cp
import cv2
import numpy as np

img_np = np.random.randint(0,256,(6,6), dtype=np.uint8)
img_cv2_cu = cv2.cuda_GpuMat(img_np)
img_cp = cp.asarray(CudaArrayInterface(img_cv2_cu))
img_cp[0:3] = 2

print('Original NumPy Array')
print(img_np)
print()
print('CuPy Array')
print(img_cp)
print('CuPy CUDA Pointer:', img_cp.__cuda_array_interface__['data'][0])
print()
print('cv2.GpuMat')
print(img_cv2_cu.download())
print(f'cv2.cuda_GpuMat CUDA Pointer: {img_cv2_cu.cudaPtr()}')

Original NumPy Array
[[123  84  54 129 126 218]
 [216 128 207 152 153 217]
 [ 81  36 164 172 220 211]
 [176 167 139  38 236  59]
 [ 30  70 186 251  90  93]
 [ 41 138   7  30  43 196]]

CuPy Array
[[  2   2   2   2   2   2]
 [  2   2   2   2   2   2]
 [  2   2   2   2   2   2]
 [176 167 139  38 236  59]
 [ 30  70 186 251  90  93]
 [ 41 138   7  30  43 196]]
CuPy CUDA Pointer: 139633457371648

cv2.GpuMat
[[  2   2   2   2   2   2]
 [  2   2   2   2   2   2]
 [  2   2   2   2   2   2]
 [176 167 139  38 236  59]
 [ 30  70 186 251  90  93]
 [ 41 138   7  30  43 196]]
cv2.cuda_GpuMat CUDA Pointer: 139633457371648

A little benchmark using %timeit yields that this operation takes 9.56 μs (+/- 33.2 ns) using an image of size (3000,3000). If we would use cp.asarray on a NumPy array of same size, it would require 1.18 ms (+/- 8.33 μs). Therefore, we can clearly see an advantage of making a cv2.cuda_GpuMat accessible via the CudaArrayInterface. And of course, we would have to account for the download (device to host) time as well if we would process the image on.

Integration with Numba

Since I stopped using Numba some time ago, I’m not going to cover this in detail here. Please have a look at the Numba: writing CUDA kernels docs for further information. As the CUDA Array Interface specification states, I would assume that numba.cuda.from_cuda_array_interface does the job of making it available to a custom Numba CUDA kernel.

Integration with PyCUDA

Interoperability with PyCUDA is important for two reasons:

a) running custom kernels

b) using tensorrt without unnecessary device <-> host copies

If PyCUDA does not recognize an object, it will try to get its pointer from a gpudata attribute. This would look like this:

# from: https://gist.github.com/gmarkall/09e4cbfe6fda4f7a35be07c8dd926f36

pycuda_wrapper = collections.namedtuple('pycuda_wrapper', ('gpudata'))

pycuda_img = pycuda_wrapper(CudaArrayInterface(img_cv2_cu).__cuda_array_interface__['data'][0])

mod = SourceModule("""SOME CUDA KERNEL""")
func = mod.get_function("SOME_CUDA_KERNEL_NAME")
# applying the kernel on the GpuMat via the cuda pointer
func(pycuda_img, ......)

Integration with deep learning frameworks

Some say you could simply use dlpack for interoperability. Since I absolutely disagree on having more dependencies than necessary, I prefer to write a few lines of code more and don’t have to worry about dependency issues after upgrades and software quality of another 3rd party library.

Accessing GpuMat as PyTorch Tensor

PyTorch supports accessing (non-sparse/dense) arrays using the CudaArrayInterface. The procedure is straight forward:

import torch
img_pytorch_cu = torch.tensor(CudaArrayInterface(img_cv2_cu), device="cuda")

# or
img_pytorch_cu = torch.as_tensor(CudaArrayInterface(img_cv2_cu), device="cuda")

NB!: using torch.as_tensor is significantly faster (17.7 +/- 0.08 μs) than using torch.tensor (90 +/- 0.2 μs).

Accessing GpuMat as TensorFlow Tensor

Unfortunately, TensorFlow (Python API, for C++ see above) does support this only for XLA. However, there is an ongoing discussion that seems to yield nothing regarding implementing this feature.

Practical Notes

Some practical notes that are independent of the programming language choses ({C++, Python}).

Blocking vs. Async (Non-Blocking) Calls

Some algorithms can run in async mode optionally, others require this. For instance some of the background substraction algorithms require a cv::cuda::Stream to be available.

 bgs_mog2 = cv2.cuda.createBackgroundSubtractorMOG2()
cuda_stream_0 = cv2.cuda_Stream()

mask_mog2 = bgs_mog2.apply(frame, -1.0, cuda_stream_0)
cuda_stream_0.waitForCompletion()

To ensure that this process is finished (if needed), we have to wait for completion.

Copy Data from Host to Device and Back

Let’s have a look at some benchmark results (average of 1000 runs, CV_8UC3 (8 bit unsigned integer, 3 color channels)):

cv::cuda::GpuMat Upload/Download Benchmarks

We can observe some clear C++ performance advantages over the Python API.

NB!: logarithmic scales with base2 and base10 are used in the graph above.

Converting Data Types

OpenCV provides some form of rudimentary support for using CUDA. However, not every function is compatible with every data type (pixel type). So, what is a data type conversion with respect to OpenCV? A simple example would be the conversion of an 8 bit (unsigned) integer array (cv2.CV_8UC1 - integer code 0) to 32 bit floating point (cv2.CV_32FC1 - integer code 5).

1.) We need an image. We could load a grayscale image or generate an image using a random number generator:

img = cv2.imread('filename.ext', cv2.IMREAD_GRAYSCALE)
# or
img = np.random.randint(0, 255, (720,1280), dtype=np.uint8)

2.) We have to move the array from host to device:

img_cu = cv2.cuda_GpuMat(img)

3.) The cv2.cuda_GpuMat class (data type) has a function named .convertTo() that can be used for pixel type conversion. It requires two inputs. First, the target pixel type (cv2.CV_...) and an cuda callocated array (type cv2.cuda_GpuMat) which has to have the right dimensions. If we simply want to change the pixel type and not create a copy to a different variable, we can do it as follows:

img_cu = img_cu.converTo(cv2.CV_32FC1, img_cu)

NB!: this function changes the pixel type but the number of channels. Using

img_cu = img_cu.converTo(cv2.CV_32FC3, img_cu)

will convert the array to cv2.CV_32FC1. It is not a color space conversion tool.