1. HW Optimization

ML Job Workflow When a researcher submits a training job, Scheduler reserves nodes, OS provides the GPU devices and memory allocations using the NVIDIA driver Container provides the correct software environment including the optimized, hardware-aware CUDA libraries User code (e.g. PyTorch, TensorFlow, JAX) uses these CUDA libraries which ultimately communicate with the driver and hardware. Quick refresher From Superchip to SuperPOD GB200 & GB200 NVL72 Specs ...

July 2, 2025 · 7 min · Hongyao Tang

2. OS Optimization

Orchestrator Job scheduler SLURM is deployed for training clusters Kubernetes is typically favored for inference clusters Open-source Slinky project and CoreWeave’s SUNK product are examples of these integrated solutions to simplify cluster management across training and inference workloads S - HW-topology-unaware scheduling Pod placements. The requirement is allocating resources to containers in a manner that is aware of the hardware topology including the NUMA node and network bandwidth configurations Kubernetes is not topology aware. It treats each GPU as a resource but doesn’t know if GPU0 and GPU1 are on the same NUMA node or if they use the same NVLink interconnect. T - Use node labels to explicitly mark GPU or node topology. ...

July 2, 2025 · 17 min · Hongyao Tang

3.1 CUDA Basic Optimization

Thread 分布式训练 S - 单个 GPU 训练 T - 只需要将张量传输到同一个 GPU 设备上,PyTorch 会处理其余的工作 将Tensor和Model传输到同一个 GPU 设备上, PyTorch 会处理其余的工作 计算在GPU 输出的 logits 也自然会在 GPU 上。 回到CPU 如果你要将 GPU 张量传给 NumPy、Pandas、Matplotlib 等非 PyTorch 库,就必须显式 .cpu(),否则会报错。 自动拷贝只发生在 print() 或 str() 等只读操作中。 可以直接 print() GPU 上的张量,不需要先手动把它 .cpu() 回主机内存。PyTorch 会自动在后台把数据从 GPU 拷贝到 CPU,以便打印。 A Basic tensor_1 = torch.tensor([1., 2., 3.]) tensor_2 = torch.tensor([4., 5., 6.]) print(tensor_1 + tensor_2) # 有两个张量可以相加。默认情况下,这个计算将在 CPU 上执行: # tensor([5., 7., 9.]) # 如果一个 PyTorch 张量存放在某个设备上,那么其操作也会在同一个设备上执行 # 将这些张量转移到 GPU 上 .to("cuda")、.to("cuda:0")、.to("cuda:1") tensor_1 = tensor_1.to("cuda") tensor_2 = tensor_2.to("cuda") print(tensor_1 + tensor_2) # 在GPU执行加法操作 # tensor([5., 7., 9.], device='cuda:0') # 所有的张量必须位于同一个设备上。否则,如果一个张量位于 CPU,另一个张量位于 GPU,计算就会失败 tensor_1 = tensor_1.to("cpu") print(tensor_1 + tensor_2) tensor and model ...

July 2, 2025 · 16 min · Hongyao Tang

3.2 CUDA Advanced Optimization

July 2, 2025 · 0 min · Hongyao Tang

4. LLM Principles Optimization

Orgnized by the way of thinking, this page shows issues and solutions in each phase of DL Data quantity, quality, public availability impacts result new datasets S - Need more this kind of data DeepMind LLM Gopher(地鼠) 是 DeepMind 早期的大语言模型之一。 Chinchilla(龙猫/毛丝鼠) 是另一种小型啮齿类动物,名字上与 Gopher 保持了一种“动物家族”的风格 模型参数 vs 训练数据量 在 Chinchilla 之前,主流观点认为:只要模型参数越多,性能就越好,即使训练数据量不变。 Chinchilla Scaling Laws 的核心观点: 在固定计算预算下,最优的训练方式是:模型规模和训练数据量应同时增长 T - FineWeb Dataset Other datasets are comparatively small English CommonCrawl section of Matrix (1.3T tokens), English CC-100 (70B tokens), Colossal-OSCAR (850B tokens) RedPajama ...

July 4, 2025 · 24 min · Hongyao Tang