1. S - CUDA Intro | Hongyao Tang

S - [nn libs]Reinvent primitives like matrix multiply, for example, from scratch

T - C++ libs

Many optimized neural-network libraries

cuDNN for neural network primitives
cuBLAS for linear algebra
NCCL for multi-GPU communication

S - [nn libs]Barrier to entry for Python developers

most of the CUDA toolkit libraries are C++ based

T - Python-based libraries

built upon the C++ toolkit
prefixed with “Cu”
- CuTile
  - breaking large matrices on GPUs into smaller, more manageable sub-matrices called “tiles”
  - take full advantage of the GPU’s parallelism without needing to manage low-level details manually
- CuPyNumeric
  - a drop-in replacement for the popular numpy Python library
  - offloading work to the GPU
  - significant performance gains for compute-intensive tasks such as large-scale numerical computations, matrix operations, and data analysis

Drop-in alternative

CuPyNumeric lower the barrier for Python developers to harness GPU power without having to learn a completely new interface, making it a powerful drop-in alternative to NumPy for high-performance computing.

lowering the barrier to entry for Python developers to build applications for NVIDIA GPUs using CUDA.

S - [runtime libs] Call driver api to manage hw

libs是针对特定任务高度优化
需要封装了底层的 CUDA Driver API
- 内存管理：如 cudaMalloc()、cudaMemcpy()。
- 设备管理：如 cudaGetDevice()、cudaSetDevice()。
- 内核启动：如 kernel«<grid, block»>()。
- 错误处理：如 cudaGetLastError()。
- 流和事件管理：如 cudaStreamCreate()。

T - runtime

#include <stdio.h>
#include <cuda_runtime.h>

__global__ void helloCUDA() {
  printf("Hello World from GPU!\n");
}

int main() {
  helloCUDA<<<1, 1>>>(); 
  cudaDeviceSynchronize(); 
  return 0;
}

#include <cuda_runtime.h>: Includes the CUDA Runtime API header file.
_ _ global _ _ void helloCUDA(): Defines a kernel function that will run on the GPU.
helloCUDA«<1, 1»>();: Launches the kernel on the GPU. «<1,1»> specifies one block and one thread per block.
cudaDeviceSynchronize();: Makes the CPU wait for the GPU to finish execution.

nvcc hello_world.cu -o hello_world

./hello_world

S - [compiler]Code compatibility across hardware versions

T - add intermediate layer of PTX

generated PTX (Parallel Thread Execution) (a.k.a assembly code for GPUs)

an intermediate language generated by NVIDIA’s CUDA compiler (nvcc) to represent GPU kernel code
serves as a portable assembly-like language that can be further compiled by driver into binary code for specific GPU architectures (e.g., SM80 for Ampere)

nvcc -ptx hellow_world.cu

Architecture-Independent: PTX is not tied to a specific GPU
backward-compatible with older NVIDIA GPU hardware and forward-compatible with newer hardware
- This is a big selling point of the NVIDIA programming model, and it’s something that Jensen Huang, NVIDIA’s CEO, reiterates with every new hardware release.

Summary

CUDA is a platform and programming model

manage CUDA-enabled GPUs.
provide C/C++ APIs for programming and managing GPUs.

CUDA Toolkit includes

Many optimized neural-network libraries
- cuDNN for neural network primitives
- cuBLAS for linear algebra
- NCCL for multi-GPU communication
CUDA runtime (cudart) - a runtime library
- A set of libraries and APIs that provide a high-level interface to the CUDA architecture. The CUDA runtime communicates directly with the NVIDIA driver to launch work and allocate memory on the GPU.
CUDA compiler(nvcc) - used to compile CUDA C++ kernels

S - [nn libs]Reinvent primitives like matrix multiply, for example, from scratch#

S - [nn libs]Barrier to entry for Python developers#

S - [runtime libs] Call driver api to manage hw#

S - [compiler]Code compatibility across hardware versions#

Summary#

S - [nn libs]Reinvent primitives like matrix multiply, for example, from scratch

S - [nn libs]Barrier to entry for Python developers

S - [runtime libs] Call driver api to manage hw

S - [compiler]Code compatibility across hardware versions

Summary