Loading lesson page...
AI From Scratch/Lesson 34/~70 minutes
Gradient Checkpointing and Activation Recomputation
Backprop keeps every intermediate activation. At 70B parameters and 128K context that is 3 TB of activations per rank. Checkpointing trades FLOPs for memory: recompute instead of save. The question is which segments to drop, and the answer...
BuildPython (with numpyoptional torch)No prerequisites