Loading lesson page...
AI From Scratch/Lesson 01/~120 minutes
Vision Transformers and the Patch-Token Primitive
Before anything multimodal, an image has to become a sequence of tokens a transformer can eat. The 2020 ViT paper answered this with 16x16 pixel patches, a linear projection, and a position embedding. Five years later every 2026 frontier m...
LearnPython (stdlibpatch tokenizer + geometry calculator)