CUDA | Hongyao Tang

1. S - CUDA Intro

S - [nn libs]Reinvent primitives like matrix multiply, for example, from scratch T - C++ libs Many optimized neural-network libraries cuDNN for neural network primitives cuBLAS for linear algebra NCCL for multi-GPU communication S - [nn libs]Barrier to entry for Python developers most of the CUDA toolkit libraries are C++ based T - Python-based libraries built upon the C++ toolkit prefixed with “Cu” CuTile breaking large matrices on GPUs into smaller, more manageable sub-matrices called “tiles” take full advantage of the GPU’s parallelism without needing to manage low-level details manually CuPyNumeric a drop-in replacement for the popular numpy Python library offloading work to the GPU significant performance gains for compute-intensive tasks such as large-scale numerical computations, matrix operations, and data analysis R ...

2. T - CUDA Thread & Memory Hierachy

3. A - CUDA Programming

S - Run on GPU T - Kernel & Kernel Launch Host: CPU HW itself Device: GPU HW itself Kernel: function that run on GPU Kernel launch: run kernel from host Kernenel execution configuration A CPU void c_hello(){ printf("Hello World!\n"); } int main() { c_hello(); return 0; } hello.cu __global__ void cuda_hello(){ printf("Hello World from GPU!\n"); } int main() { cuda_hello<<<1,1>>>(); return 0; } _ _ global _ _ specifier, specify a function as kernel cuda_hello(); host code call, called kernel launch «<N,N»> syntax, provides kernenel execution configuration Compile & run ...