Loading lesson page...
AI From Scratch/Lesson 11/~45 minutes
Mixture of Experts (MoE)
A dense 70B transformer activates every parameter for every token. A 671B MoE activates only 37B per token and beats it on every benchmark. Sparsity is the most important scaling idea of the decade.
BuildPython