Loading lesson page...
AI From Scratch/Lesson 59/~90 minutes
Vision Transformer Encoder
Patches alone do not see. A 12-layer pre-LN transformer with 12 attention heads turns the sequence of patch tokens into a sequence of contextual tokens, with the CLS token pooling whole-image features in its final hidden state. This lesson...
BuildPythonNo prerequisites