Phase 12: Multimodal AI
AI From Scratch/Lesson 17/~180 minutes

Video-Language Models: Temporal Tokens and Grounding

Video is not a stack of photos. A 5-second clip has causal ordering, action verbs, and event timing that an image model cannot represent. Video-LLaMA (Zhang et al., June 2023) shipped the first open video-LLM with audio-visual grounding. V...

BuildNo prerequisites
Loading lesson page...