S - [nn libs]Reinvent primitives like matrix multiply, for example, from scratch

T - C++ libs

Many optimized neural-network libraries

  • cuDNN for neural network primitives
  • cuBLAS for linear algebra
  • NCCL for multi-GPU communication

S - [nn libs]Barrier to entry for Python developers

  • most of the CUDA toolkit libraries are C++ based

T - Python-based libraries

  • built upon the C++ toolkit
  • prefixed with “Cu”
    • CuTile
      • breaking large matrices on GPUs into smaller, more manageable sub-matrices called “tiles”
      • take full advantage of the GPU’s parallelism without needing to manage low-level details manually
    • CuPyNumeric
      • a drop-in replacement for the popular numpy Python library
      • offloading work to the GPU
      • significant performance gains for compute-intensive tasks such as large-scale numerical computations, matrix operations, and data analysis

R

Drop-in alternative

  • CuPyNumeric lower the barrier for Python developers to harness GPU power without having to learn a completely new interface, making it a powerful drop-in alternative to NumPy for high-performance computing.

lowering the barrier to entry for Python developers to build applications for NVIDIA GPUs using CUDA.

S - [runtime libs] Call driver api to manage hw

  • libs是针对特定任务高度优化
  • 需要封装了底层的 CUDA Driver API
    • 内存管理:如 cudaMalloc()、cudaMemcpy()。
    • 设备管理:如 cudaGetDevice()、cudaSetDevice()。
    • 内核启动:如 kernel«<grid, block»>()。
    • 错误处理:如 cudaGetLastError()。
    • 流和事件管理:如 cudaStreamCreate()。

T - runtime

A

#include <stdio.h>
#include <cuda_runtime.h>

__global__ void helloCUDA() {
  printf("Hello World from GPU!\n");
}

int main() {
  helloCUDA<<<1, 1>>>(); 
  cudaDeviceSynchronize(); 
  return 0;
}
  • #include <cuda_runtime.h>: Includes the CUDA Runtime API header file.
  • _ _ global _ _ void helloCUDA(): Defines a kernel function that will run on the GPU.
  • helloCUDA«<1, 1»>();: Launches the kernel on the GPU. «<1,1»> specifies one block and one thread per block.
  • cudaDeviceSynchronize();: Makes the CPU wait for the GPU to finish execution.

nvcc hello_world.cu -o hello_world

./hello_world

S - [compiler]Code compatibility across hardware versions

T - add intermediate layer of PTX

generated PTX (Parallel Thread Execution) (a.k.a assembly code for GPUs)

  • an intermediate language generated by NVIDIA’s CUDA compiler (nvcc) to represent GPU kernel code
  • serves as a portable assembly-like language that can be further compiled by driver into binary code for specific GPU architectures (e.g., SM80 for Ampere)

A

nvcc -ptx hellow_world.cu

R

  • Architecture-Independent: PTX is not tied to a specific GPU
  • backward-compatible with older NVIDIA GPU hardware and forward-compatible with newer hardware
    • This is a big selling point of the NVIDIA programming model, and it’s something that Jensen Huang, NVIDIA’s CEO, reiterates with every new hardware release.

Summary

CUDA is a platform and programming model

  • manage CUDA-enabled GPUs.
  • provide C/C++ APIs for programming and managing GPUs.

CUDA Toolkit includes

  • Many optimized neural-network libraries
    • cuDNN for neural network primitives
    • cuBLAS for linear algebra
    • NCCL for multi-GPU communication
  • CUDA runtime (cudart) - a runtime library
    • A set of libraries and APIs that provide a high-level interface to the CUDA architecture. The CUDA runtime communicates directly with the NVIDIA driver to launch work and allocate memory on the GPU.
  • CUDA compiler(nvcc) - used to compile CUDA C++ kernels