Loading lesson page...
AI From Scratch/Lesson 05/~180 minutes
LLaVA and Visual Instruction Tuning
LLaVA (April 2023) is the most copied multimodal architecture on the planet. It replaced BLIP-2's Q-Former with a 2-layer MLP, replaced Flamingo's gated cross-attention with naive token concatenation, and trained on 158k visual-instruction...
BuildNo prerequisites