AI From Scratch/Lesson 18/~60 minutes

Multi-Token Prediction (MTP)

Every autoregressive LLM from GPT-2 to Llama 3 trains on one loss per position: predict the next token. DeepSeek-V3 added a second loss per position: predict the token after that. The extra 14B of parameters (on a 671B model) got distilled...

BuildPython (stdlib)No prerequisites

Loading lesson page...