AI From Scratch/Lesson 34/~70 minutes

Gradient Checkpointing and Activation Recomputation

Backprop keeps every intermediate activation. At 70B parameters and 128K context that is 3 TB of activations per rank. Checkpointing trades FLOPs for memory: recompute instead of save. The question is which segments to drop, and the answer...

BuildPython (with numpyoptional torch)No prerequisites

Loading lesson page...