Loading lesson page...
AI From Scratch/Lesson 10/~90 minutes
Evaluation: Benchmarks, Evals, LM Harness
Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. Every frontier lab games benchmarks. MMLU scores go up while models still can't reliably count the number of R's in "strawberry." The only eval that matters i...
BuildPythonNo prerequisites