Phase 12: Multimodal AI
AI From Scratch/Lesson 11/~180 minutes

Chameleon and Early-Fusion Token-Only Multimodal Models

Every VLM we have seen so far keeps images and text separate. Visual tokens come from a vision encoder, flow into a projector, then meet text inside the LLM. The vision and text vocabularies never overlap. Chameleon (Meta, May 2024) asked:...

Build
Loading lesson page...