Loading lesson page...
AI From Scratch/Lesson 62/~90 minutes
Vision-Language Pretraining
The encoder, projection, and decoder are wired. Now train them together. Two objectives drive learning: a contrastive image-text loss (InfoNCE) that pulls matching pairs together in the joint embedding space, and a language modeling loss t...
BuildPythonNo prerequisites