Loading lesson page...
AI From Scratch/Lesson 19/~60 minutes
DualPipe Parallelism
DeepSeek-V3 was trained on 2,048 H800 GPUs with MoE experts scattered across nodes. Cross-node expert all-to-all communication cost 1 GPU-hour of comm for every 1 GPU-hour of compute. GPUs were idle half the time. DualPipe (DeepSeek, Dec 2...
LearnPython (stdlibschedule simulator)No prerequisites