Phase 12: Multimodal AI
AI From Scratch/Lesson 01/~120 minutes

Vision Transformers and the Patch-Token Primitive

Before anything multimodal, an image has to become a sequence of tokens a transformer can eat. The 2020 ViT paper answered this with 16x16 pixel patches, a linear projection, and a position embedding. Five years later every 2026 frontier m...

LearnPython (stdlibpatch tokenizer + geometry calculator)
Loading lesson page...