Phase 12: Multimodal AI
AI From Scratch/Lesson 08/~180 minutes

LLaVA-OneVision: Single-Image, Multi-Image, Video in One Model

Before LLaVA-OneVision (Li et al., August 2024) the open-VLM world had separate lineages: LLaVA-1.5 for single images, multi-image models like Mantis and VILA, video models like Video-LLaVA and Video-LLaMA. Each won its benchmark and faile...

BuildPython (stdlibtoken budget solver + curriculum planner)No prerequisites
Loading lesson page...