Loading lesson page...
AI From Scratch/Lesson 16/~60 minutes
Differential Attention (V2)
Softmax attention spreads a small amount of probability over every non-matching token. Over 100k tokens that noise adds up and drowns the signal. Differential Transformer (Ye et al., ICLR 2025) fixes it by computing attention as the differ...
BuildPython (stdlib)No prerequisites