Loading lesson page...
AI From Scratch/Lesson 03/~180 minutes
From CLIP to BLIP-2 — Q-Former as Modality Bridge
CLIP aligns image and text but cannot generate captions, answer questions, or hold a conversation. BLIP-2 (Salesforce, 2023) solved that with a small trainable bridge: 32 learnable query vectors attend over a frozen ViT's features via cros...
Build