AI From Scratch/Lesson 19/~60 minutes

DualPipe Parallelism

DeepSeek-V3 was trained on 2,048 H800 GPUs with MoE experts scattered across nodes. Cross-node expert all-to-all communication cost 1 GPU-hour of comm for every 1 GPU-hour of compute. GPUs were idle half the time. DualPipe (DeepSeek, Dec 2...

LearnPython (stdlibschedule simulator)No prerequisites

Loading lesson page...