S - [nn libs]Reinvent primitives like matrix multiply, for example, from scratch
T - C++ libs
Many optimized neural-network libraries
- cuDNN for neural network primitives
- cuBLAS for linear algebra
- NCCL for multi-GPU communication
S - [nn libs]Barrier to entry for Python developers
- most of the CUDA toolkit libraries are C++ based
T - Python-based libraries
- built upon the C++ toolkit
- prefixed with “Cu”
- CuTile
- breaking large matrices on GPUs into smaller, more manageable sub-matrices called “tiles”
- take full advantage of the GPU’s parallelism without needing to manage low-level details manually
- CuPyNumeric
- a drop-in replacement for the popular numpy Python library
- offloading work to the GPU
- significant performance gains for compute-intensive tasks such as large-scale numerical computations, matrix operations, and data analysis
- CuTile
R
Drop-in alternative
- CuPyNumeric lower the barrier for Python developers to harness GPU power without having to learn a completely new interface, making it a powerful drop-in alternative to NumPy for high-performance computing.
lowering the barrier to entry for Python developers to build applications for NVIDIA GPUs using CUDA.
S - [runtime libs] Call driver api to manage hw
- libs是针对特定任务高度优化
- 需要封装了底层的 CUDA Driver API
- 内存管理:如 cudaMalloc()、cudaMemcpy()。
- 设备管理:如 cudaGetDevice()、cudaSetDevice()。
- 内核启动:如 kernel«<grid, block»>()。
- 错误处理:如 cudaGetLastError()。
- 流和事件管理:如 cudaStreamCreate()。
T - runtime
A
#include <stdio.h>
#include <cuda_runtime.h>
__global__ void helloCUDA() {
printf("Hello World from GPU!\n");
}
int main() {
helloCUDA<<<1, 1>>>();
cudaDeviceSynchronize();
return 0;
}
- #include <cuda_runtime.h>: Includes the CUDA Runtime API header file.
- _ _ global _ _ void helloCUDA(): Defines a kernel function that will run on the GPU.
- helloCUDA«<1, 1»>();: Launches the kernel on the GPU. «<1,1»> specifies one block and one thread per block.
- cudaDeviceSynchronize();: Makes the CPU wait for the GPU to finish execution.
nvcc hello_world.cu -o hello_world
./hello_world
S - [compiler]Code compatibility across hardware versions
T - add intermediate layer of PTX
generated PTX (Parallel Thread Execution) (a.k.a assembly code for GPUs)
- an intermediate language generated by NVIDIA’s CUDA compiler (nvcc) to represent GPU kernel code
- serves as a portable assembly-like language that can be further compiled by driver into binary code for specific GPU architectures (e.g., SM80 for Ampere)
A
nvcc -ptx hellow_world.cu
R
- Architecture-Independent: PTX is not tied to a specific GPU
- backward-compatible with older NVIDIA GPU hardware and forward-compatible with newer hardware
- This is a big selling point of the NVIDIA programming model, and it’s something that Jensen Huang, NVIDIA’s CEO, reiterates with every new hardware release.
Summary
CUDA is a platform and programming model
- manage CUDA-enabled GPUs.
- provide C/C++ APIs for programming and managing GPUs.
CUDA Toolkit includes
- Many optimized neural-network libraries
- cuDNN for neural network primitives
- cuBLAS for linear algebra
- NCCL for multi-GPU communication
- CUDA runtime (cudart) - a runtime library
- A set of libraries and APIs that provide a high-level interface to the CUDA architecture. The CUDA runtime communicates directly with the NVIDIA driver to launch work and allocate memory on the GPU.
- CUDA compiler(nvcc) - used to compile CUDA C++ kernels