What is CUDA? Parallel Programming for GPUs

CUDA is a parallel computing platform and programming model developed by NVIDIA for general computing on proprietary GPUs (graphics processing units). CUDA allows developers to accelerate compute-intensive applications by leveraging the power of GPUs for the parallelizable portion of the computation.

Although there are other proposed GPU APIs such as OpenCLand there are competing GPUs from other companies such as AMDthe combination of CUDA and NVIDIA GPUs dominates several application areas, including deep learning, and is the basis for some of the world’s fastest computers.

Graphics cards are probably as old as the computer—that is, if you consider the 1981 IBM Monochrome Display Adapter to be a graphics card. By 1988, you could get a 16-bit 2D VGA Wonder card from ATI (the company was eventually acquired by AMD). By 1996, you could buy a 3D graphics accelerator from 3dfx to be able to run the first-person shooter Quake at full speed.

Also in 1996, NVIDIA began trying to compete in the 3D accelerator market with weak products, but learned in the process and in 1999 introduced the successful GeForce 256, the first graphics card called a GPU. At the time, the main reason for having a GPU was for gaming. It wasn’t until later that people used GPUs for math, science, and engineering.

The origins of CUDA

In 2003, a team of researchers led by Ian Buck unveiled Brook, the first widely adopted programming model for extending C with data-parallel constructs. Buck later joined NVIDIA and led the launch of CUDA in 2006, the first commercial GPU-based general-purpose computing solution.

OpenCL vs. CUDA

CUDA competitor OpenCL started in 2009 in an effort to provide a standard for heterogeneous computing that is not limited to Intel/AMD processors with NVIDIA GPUs. Although OpenCL sounds attractive because of its mainstream, it doesn’t perform as well as CUDA on NVIDIA GPUs, and many deep learning frameworks either don’t support OpenCL or only support it as an afterthought after their CUDA support is released.

CUDA performance boost

CUDA has improved and expanded its scope over the years, more or less in step with improved NVIDIA GPUs. By using multiple P100 server GPUs, you can achieve up to 50x performance improvements over CPUs. The V100 (not shown in this figure) is another 3x faster for some workloads (so up to 150x CPUs), and the A100 (also not shown) is another 2 times faster (up to 300x processors). The previous generation of server GPUs, the K80, offered 5x to 12x performance improvements over CPUs.

Note that not everyone reports the same speed increases and that there is improvement in software for training CPU models, for example using Intel Math Kernel Library. In addition, there is an improvement in the processors themselves, mostly to provide more cores.


The speed boost from GPUs has come at the perfect time for high-performance computing. The single-threaded increase in CPU performance over time, which Moore’s Law predicts would double every 18 months, has slowed to 10% per year as chip makers hit physical constraints, including die size limits chip mask capability and chip yield during the manufacturing process and thermal limitations of clock speeds during execution.


CUDA application domains


CUDA and NVIDIA GPUs have been adopted in many fields that need high floating-point computing performance, as summarized in the image above. A more comprehensive list includes:

  1. Computer Finance
  2. Climate, Weather and Ocean Modelling
  3. Data Science and Analytics
  4. Deep learning and machine learning
  5. Defense and Intelligence
  6. Manufacturing/AEC (Architecture, Engineering and Construction): CAD and CAE (including Computational Fluid Dynamics, Computational Structural Mechanics, Design and Visualization and Electronic Design Automation)
  7. Media and Entertainment (including animation, modeling and rendering; color correction and grain management; compositing; finishing and effects; editing; encoding and digital distribution; on-air graphics; photography, review and stereo tools; and weather graphics)
  8. Medical imaging
  9. Oil and gas
  10. Research: Higher Education and Supercomputing (including Computational Chemistry and Biology, Numerical Analysis, Physics and Scientific Visualization)
  11. Safety and security
  12. Tools and controls

CUDA in deep learning

Deep learning has a huge need for computing speed. For example, for training the models for Google Translate in 2016, the Google Brain and Google Translate teams ran hundreds of one-week runs of TensorFlow using GPUs; they had purchased 2000 server GPUs from NVIDIA for the purpose. Without GPUs, these training runs would take months, not a week, to converge. For production deployment of these TensorFlow translation models, Google uses a new custom processing chip, the TPU (Tensor Processor).

In addition to TensorFlow, many other deep learning frameworks rely on CUDA for their GPU support, including Caffe2, Chainer, Databricks,, Keras, MATLAB, MXNet, PyTorch, Theano, and Torch. In most cases, they use cuDNN a library for deep neural network computing. This library is so important to training deep learning frameworks that all frameworks using a given version of cuDNN have essentially the same performance metrics for equivalent use cases. As CUDA and cuDNN improve from version to version, all deep learning frameworks that update to the new version see performance improvements. Where performance tends to differ from frame to frame is in how well they scale to multiple GPUs and multiple nodes.

CUDA Toolkit

The CUDA Toolkit includes libraries, debugging and optimization tools, a compiler, documentation, and a runtime library for deploying your applications. It has components that support deep learning, linear algebra, signal processing and parallel algorithms.

In general, the CUDA libraries support all NVIDIA GPU families, but perform best on the latest generation, such as the V100, which can be 3x faster than the P100 for deep learning workloads, as shown in -down; The A100 can add an additional 2x acceleration. Using one or more libraries is the easiest way to take advantage of GPUs, as long as the algorithms you need are implemented in the respective library.


CUDA Deep Learning Libraries

In the field of deep learning, there are three main GPU-accelerated libraries: cuDNNwhich I mentioned earlier as the GPU component for most open source deep learning frameworks; TensorRT, which is NVIDIA’s high-performance deep learning inference optimizer and runtime; and DeepStream, a video output library. TensorRT helps you optimize neural network models, calibrate for lower precision with high accuracy, and deploy trained models in hyperscale data centers, embedded systems, or automotive product platforms.


CUDA linear algebra and math libraries

Linear algebra is the basis of tensor computing and therefore of deep learning. BLAS (Basic Linear Algebra Subroutines), a collection of matrix algorithms implemented in Fortran in 1989, it has been used by scientists and engineers ever since. cuBLAS is a GPU-accelerated version of BLAS and the most efficient way to do matrix arithmetic with the GPU. cuBLAS assumes that matrices are dense; cuSPARSE handle rarefied matrices.


CUDA libraries for signal processing

The Fast Fourier Transform (FFT) is one of the main algorithms used for signal processing; it converts a signal (such as an audio waveform) into a spectrum of frequencies. cuFFT is a GPU-accelerated FFT.

Codecs using standards such as H.264 encode/compress and decode/decompress video for transmission and display. The NVIDIA Video Codec SDK accelerates this process with the GPU.


CUDA parallel algorithm libraries

The three libraries for parallel algorithms have different purposes. NCCL (NVIDIA Collective Communications Library) is for scaling applications across multiple GPUs and nodes; nvGRAPH is for parallel graph analysis; and Traction is a C++ template library for CUDA based on the C++ Standard Template Library. Thrust provides a rich collection of parallel data primitives such as scan, sort, and reduce.


CUDA vs CPU performance

In some cases, you can use inline CUDA functions instead of the equivalent CPU functions. For example, on gemm matrix multiplication routines from BLAS can be replaced by GPU versions simply by linking to NVBLAS library:


Fundamentals of CUDA Programming

If you can’t find CUDA library routines to speed up your programs, you’ll need to try your hand at CUDA low-level programming. It’s much easier now than it was when I first tried it in the late 2000s. Among other reasons, there is an easier syntax and better development tools available.

My only gripe is that macOS support for running CUDA has disappeared after a long fall into obsolescence. The most you can do on macOS is yes control debugging and profiling sessions runs on Linux or Windows.

To understand CUDA programming, take a look at this simple C/C++ routine to add two arrays:

void add(int n, float *x, float *y)
       for (int i = 0; i < n; i++)      
             y[i] = x[i] + y[i];

You can turn it into a kernel that will run on the GPU by adding __global__ keyword to the declaration and call the kernel using the triple parentheses syntax:

add<<<1, 1>>>(N, x, y);

You should also change yours malloc/new and free/delete calls to cudaMallocManaged and cudaFree so that you allocate space to the GPU. Finally, you need to wait for the GPU calculation to finish before using the CPU results, which you can achieve with cudaDeviceSynchronize.

The triple clamp above uses one threaded and one threaded block. Current NVIDIA GPUs can handle many blocks and threads. For example Tesla P100 GPU based on Pascal GPU architecture there are 56 streaming multiprocessors (SMs), each of which can support up to 2048 active threads.

The kernel code will need to know its block and thread index to find its offset in the passed arrays. A parallelized kernel often uses a grid-step cycle, such as the following:

void add(int n, float *x, float *y)
   int index = blockIdx.x * blockDim.x + threadIdx.x;
   int stride = blockDim.x * gridDim.x;
   for (int i = index; i < n; i += stride)
     y[i] = x[i] + y[i];

If you look at the examples in the CUDA Toolkit, you’ll see that there’s more to explore than the basics I’ve covered above. For example, some CUDA function calls must be wrapped checkCudaErrors() calls. Also, in many cases the fastest code will use libraries like cuBLAS along with host and device memory allocation and matrix copying back and forth.

In summary, you can accelerate your applications with GPU on many levels. You can write CUDA code, you can call CUDA libraries, and you can use applications that already support CUDA.

Copyright © 2022 IDG Communications, Inc.

Exit mobile version