AI From Scratch/Lesson 62/~90 minutes

Vision-Language Pretraining

The encoder, projection, and decoder are wired. Now train them together. Two objectives drive learning: a contrastive image-text loss (InfoNCE) that pulls matching pairs together in the joint embedding space, and a language modeling loss t...

BuildPythonNo prerequisites

Loading lesson page...