Loading lesson page...
AI From Scratch/Lesson 18/~60 minutes
Multi-Token Prediction (MTP)
Every autoregressive LLM from GPT-2 to Llama 3 trains on one loss per position: predict the next token. DeepSeek-V3 added a second loss per position: predict the token after that. The extra 14B of parameters (on a 671B model) got distilled...
BuildPython (stdlib)No prerequisites