Loading lesson page...
AI From Scratch/Lesson 19/~60 minutes
Benchmarks: SWE-bench, GAIA, AgentBench
Three benchmarks anchor agent evaluation in 2026. SWE-bench tests code patching. GAIA tests generalist tool use. AgentBench tests multi-environment reasoning. Know their composition, their contamination story, and what they do not measure.
LearnPython (stdlib)