Phase 12: Multimodal AI
AI From Scratch/Lesson 03/~180 minutes

From CLIP to BLIP-2 — Q-Former as Modality Bridge

CLIP aligns image and text but cannot generate captions, answer questions, or hold a conversation. BLIP-2 (Salesforce, 2023) solved that with a small trainable bridge: 32 learnable query vectors attend over a frozen ViT's features via cros...

Build
Loading lesson page...