Phase 19: Capstone Projects
AI From Scratch/Lesson 80/~90 min

Sharded Checkpoint and Atomic Resume

A 70B-parameter training job is paused by a node failure every few hours. The checkpoint format decides whether you lose 30 minutes or 30 hours. A sharded checkpoint writes every rank's shard in parallel and records ownership in a manifest...

BuildPythonNo prerequisites
Loading lesson page...