AI From Scratch/Lesson 16/~60 minutes

Differential Attention (V2)

Softmax attention spreads a small amount of probability over every non-matching token. Over 100k tokens that noise adds up and drowns the signal. Differential Transformer (Ye et al., ICLR 2025) fixes it by computing attention as the differ...

BuildPython (stdlib)No prerequisites

Loading lesson page...