AI From Scratch/Lesson 19/~60 minutes

Benchmarks: SWE-bench, GAIA, AgentBench

Three benchmarks anchor agent evaluation in 2026. SWE-bench tests code patching. GAIA tests generalist tool use. AgentBench tests multi-environment reasoning. Know their composition, their contamination story, and what they do not measure.

LearnPython (stdlib)

Loading lesson page...