Loading lesson page...
AI From Scratch/Lesson 78/~90 min
ZeRO Optimizer State Sharding
Adam stores two moment estimates per parameter, both in float32. A 7B-parameter model carries 56 GB of optimiser state. ZeRO stage 1 shards that across N ranks; each rank owns 1/N of the optimiser. After the local step the updated paramete...
BuildPythonNo prerequisites