Loading lesson page...
AI From Scratch/Lesson 80/~90 min
Sharded Checkpoint and Atomic Resume
A 70B-parameter training job is paused by a node failure every few hours. The checkpoint format decides whether you lose 30 minutes or 30 hours. A sharded checkpoint writes every rank's shard in parallel and records ownership in a manifest...
BuildPythonNo prerequisites