1. S - CUDA Intro

S - [nn libs]Reinvent primitives like matrix multiply, for example, from scratch T - C++ libs Many optimized neural-network libraries cuDNN for neural network primitives cuBLAS for linear algebra NCCL for multi-GPU communication S - [nn libs]Barrier to entry for Python developers most of the CUDA toolkit libraries are C++ based T - Python-based libraries built upon the C++ toolkit prefixed with “Cu” CuTile breaking large matrices on GPUs into smaller, more manageable sub-matrices called “tiles” take full advantage of the GPU’s parallelism without needing to manage low-level details manually CuPyNumeric a drop-in replacement for the popular numpy Python library offloading work to the GPU significant performance gains for compute-intensive tasks such as large-scale numerical computations, matrix operations, and data analysis R ...

July 3, 2025 · 3 min · Hongyao Tang

2. T - CUDA Thread & Memory Hierachy

Thread HW POV(线程的硬件执行): focus Core than GPU A GPU COMP |_2 division |_4 GPC (Graphics Processing Cluster) |_8 TPC (Texture/Processing Cluster) |_2 SM (Streaming Multiprocessor) MEM |_other components |_HMB3e + memory controller |_L2 Cache |_NVLink + High-speed hub |_PCIE A SM INST |_L1 instruction cache |_ 4 * |_L0 instruction cache COMP |_ 4 * |_Cores - 32 FP32 cores + 16 FP64 cores + 16 INT32 cores + 1 mixed-precision tensor cores |_Units - 8 LD/ST units + 4 Special Function Units/SFU + 1 Dispatch unit |_16K * 32 bit register file |_Warp scheduler MEM |_256KB L1 data cache/shared memory |_Tensor Memory Accelerator ...

July 3, 2025 · 3 min · Hongyao Tang

3. A - CUDA Programming

S - Run on GPU T - Kernel & Kernel Launch Host: CPU HW itself Device: GPU HW itself Kernel: function that run on GPU Kernel launch: run kernel from host Kernenel execution configuration A CPU void c_hello(){ printf("Hello World!\n"); } int main() { c_hello(); return 0; } hello.cu __global__ void cuda_hello(){ printf("Hello World from GPU!\n"); } int main() { cuda_hello<<<1,1>>>(); return 0; } _ _ global _ _ specifier, specify a function as kernel cuda_hello(); host code call, called kernel launch «<N,N»> syntax, provides kernenel execution configuration Compile & run ...

July 3, 2025 · 10 min · Hongyao Tang