Loading lesson page...
AI From Scratch/Lesson 61/~90 minutes
Cross-Attention Fusion
The projection layer aligns one image vector with one caption vector. A real vision-language decoder needs every text token to attend to every patch token, so the model can ground each word in a region. Cross-attention is how that groundin...
BuildPythonNo prerequisites