Phase 12: Multimodal AI
AI From Scratch/Lesson 18/~180 minutes

Long-Video Understanding at Million-Token Context

A 1-hour 4K video at 24 FPS, patched and embedded, produces on the order of 60 million tokens. A 2-hour podcast episode transcribed is 30,000 tokens. A full Blu-ray feature film, even compressed with aggressive pooling, is hundreds of thou...

Build
Loading lesson page...