Phase 12: Multimodal AI
AI From Scratch/Lesson 05/~180 minutes

LLaVA and Visual Instruction Tuning

LLaVA (April 2023) is the most copied multimodal architecture on the planet. It replaced BLIP-2's Q-Former with a 2-layer MLP, replaced Flamingo's gated cross-attention with naive token concatenation, and trained on 158k visual-instruction...

BuildNo prerequisites
Loading lesson page...