AI From Scratch/Lesson 59/~90 minutes

Vision Transformer Encoder

Patches alone do not see. A 12-layer pre-LN transformer with 12 attention heads turns the sequence of patch tokens into a sequence of contextual tokens, with the CLS token pooling whole-image features in its final hidden state. This lesson...

BuildPythonNo prerequisites

Loading lesson page...