Loading lesson page...
AI From Scratch/Lesson 60/~90 minutes
Projection Layer for Modality Alignment
A vision encoder produces image tokens. A text decoder consumes text tokens. The two live in different vector spaces. A small two-layer MLP projects image tokens into the text embedding space, and a cosine alignment loss against a paired c...
BuildPythonNo prerequisites