AI From Scratch/Lesson 60/~90 minutes

Projection Layer for Modality Alignment

A vision encoder produces image tokens. A text decoder consumes text tokens. The two live in different vector spaces. A small two-layer MLP projects image tokens into the text embedding space, and a cosine alignment loss against a paired c...

BuildPythonNo prerequisites

Loading lesson page...