Phase 10: LLMs from Scratch
AI From Scratch/Lesson 10/~90 minutes

Evaluation: Benchmarks, Evals, LM Harness

Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. Every frontier lab games benchmarks. MMLU scores go up while models still can't reliably count the number of R's in "strawberry." The only eval that matters i...

BuildPythonNo prerequisites
Loading lesson page...