1. HW Optimization

ML Job Workflow When a researcher submits a training job, Scheduler reserves nodes, OS provides the GPU devices and memory allocations using the NVIDIA driver Container provides the correct software environment including the optimized, hardware-aware CUDA libraries User code (e.g. PyTorch, TensorFlow, JAX) uses these CUDA libraries which ultimately communicate with the driver and hardware. Quick refresher From Superchip to SuperPOD GB200 & GB200 NVL72 Specs ...

July 2, 2025 · 7 min · Hongyao Tang

2. OS Optimization

Orchestrator Job scheduler SLURM is deployed for training clusters Kubernetes is typically favored for inference clusters Open-source Slinky project and CoreWeave’s SUNK product are examples of these integrated solutions to simplify cluster management across training and inference workloads S - HW-topology-unaware scheduling Pod placements. The requirement is allocating resources to containers in a manner that is aware of the hardware topology including the NUMA node and network bandwidth configurations Kubernetes is not topology aware. It treats each GPU as a resource but doesn’t know if GPU0 and GPU1 are on the same NUMA node or if they use the same NVLink interconnect. T - Use node labels to explicitly mark GPU or node topology. ...

July 2, 2025 · 17 min · Hongyao Tang

3.1 CUDA Basic Optimization

Thread 分布式训练 S - 单个 GPU 训练 T - 只需要将张量传输到同一个 GPU 设备上,PyTorch 会处理其余的工作 将Tensor和Model传输到同一个 GPU 设备上, PyTorch 会处理其余的工作 计算在GPU 输出的 logits 也自然会在 GPU 上。 回到CPU 如果你要将 GPU 张量传给 NumPy、Pandas、Matplotlib 等非 PyTorch 库,就必须显式 .cpu(),否则会报错。 自动拷贝只发生在 print() 或 str() 等只读操作中。 可以直接 print() GPU 上的张量,不需要先手动把它 .cpu() 回主机内存。PyTorch 会自动在后台把数据从 GPU 拷贝到 CPU,以便打印。 A Basic tensor_1 = torch.tensor([1., 2., 3.]) tensor_2 = torch.tensor([4., 5., 6.]) print(tensor_1 + tensor_2) # 有两个张量可以相加。默认情况下,这个计算将在 CPU 上执行: # tensor([5., 7., 9.]) # 如果一个 PyTorch 张量存放在某个设备上,那么其操作也会在同一个设备上执行 # 将这些张量转移到 GPU 上 .to("cuda")、.to("cuda:0")、.to("cuda:1") tensor_1 = tensor_1.to("cuda") tensor_2 = tensor_2.to("cuda") print(tensor_1 + tensor_2) # 在GPU执行加法操作 # tensor([5., 7., 9.], device='cuda:0') # 所有的张量必须位于同一个设备上。否则,如果一个张量位于 CPU,另一个张量位于 GPU,计算就会失败 tensor_1 = tensor_1.to("cpu") print(tensor_1 + tensor_2) tensor and model ...

July 2, 2025 · 16 min · Hongyao Tang

3.2 CUDA Advanced Optimization

July 2, 2025 · 0 min · Hongyao Tang