AI From Scratch/Lesson 33/~90 minutes

Multi-Head Self-Attention

One linear projection, three views, H parallel heads, one mask. The attention block as the model actually uses it.

BuildPython

Loading lesson page...