AI From Scratch/Lesson 10/~90 minutes

Evaluation: Benchmarks, Evals, LM Harness

Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. Every frontier lab games benchmarks. MMLU scores go up while models still can't reliably count the number of R's in "strawberry." The only eval that matters i...

BuildPythonNo prerequisites

Loading lesson page...