Phase 12: Multimodal AI
AI From Scratch/Lesson 02/~180 minutes

CLIP and Contrastive Vision-Language Pretraining

OpenAI's CLIP (2021) proved a single idea big enough to power the next five years: align an image encoder and a text encoder in the same vector space using only noisy web image-caption pairs and a contrastive loss. Zero supervised labels....

BuildPython (stdlibInfoNCE + sigmoid loss implementations)
Loading lesson page...