Phase 12: Multimodal AI
AI From Scratch/Lesson 12/~120 minutes

Emu3: Next-Token Prediction for Image and Video Generation

BAAI's Emu3 (Wang et al., September 2024) is the 2024 result that should have ended the diffusion-versus-autoregressive debate. A single Llama-style decoder-only transformer, trained only on the next-token-prediction objective, across a un...

LearnPython (stdlib3D video tokenizer math + autoregressive sampler skeleton)
Loading lesson page...