AI From Scratch/Lesson 61/~90 minutes

Cross-Attention Fusion

The projection layer aligns one image vector with one caption vector. A real vision-language decoder needs every text token to attend to every patch token, so the model can ground each word in a region. Cross-attention is how that groundin...

BuildPythonNo prerequisites

Loading lesson page...