17:["$","article",null,{"className":"min-w-0 min-h-0","children":["$","$L19",null,{"sections":[{"key":"overview","title":"Overview","markdown":"> A coding agent is only as good as the suite of tasks you measure it against. This lesson builds an evaluation harness that takes a folder of fixture tasks, runs each through a candidate agent, scores pass or fail through a deterministic verifier, and aggregates the results into pass@1, pass@k, mean latency, and mean cost. The harness is the source of truth that lets you tell a regression from a refactor."},{"key":"objectives","title":"Learning Objectives","markdown":"- Define a fixture task as a triple of goal, setup, and verifier.\n- Score multiple sample runs per task and compute pass@1 and pass@k.\n- Aggregate latency and cost into mean and 95th-percentile metrics.\n- Wire deterministic verifiers (file diff, exit code, regex match) into reusable functions.\n- Emit a structured JSON report a regression-tracking script can ingest."},{"key":"problem","title":"The Problem","markdown":"Three failure modes plague agent benchmarks built without an eval harness.\n\nThe first is unverified pass. The agent says it fixed the bug, the human glances at the diff, the suite is marked green, and three weeks later the regression test surfaces the same bug. The agent had reasoned plausibly without actually fixing anything.\n\nThe second is undetected regression. A change to the prompt template makes the agent 4% better on the loud task and 14% worse on the quiet one. Without a goldset and a per-task score, the regression rides into main and surfaces only when a customer complains.\n\nThe third is per-task drift. The eval was run on Monday with 100 tasks and on Friday with 95 of them, because somebody renamed five fixtures. The pass rate looks like a 5% improvement. It isn't.\n\nThe harness is the program that turns these failures into facts. It runs every fixture, every time, in a reproducible order, against a verifier that returns true or false on a deterministic check."},{"key":"concept","title":"The Concept","markdown":"$1a"},{"key":"architecture","title":"Architecture","markdown":"```mermaid\nflowchart TD\n Harness[EvalHarness] -->|load| Task[FixtureTask
goal / setup / verifier]\n Harness --> Loop[for each task:
prepare scratch dir from setup
for sample in range k:
run candidate task, scratch_dir -> SampleResult
verify sample, task -> bool
record per-task aggregate]\n Loop --> TaskReport[TaskReport
task_id / k / passes / pass_rate
mean_latency / mean_cost]\n TaskReport -->|aggregate| EvalReport[EvalReport
total tasks / pass@1 / pass@k / p95 latency]\n```\n\nThe candidate is a callable: `Callable[[FixtureTask, str], SampleResult]`. The harness creates the scratch directory via `tempfile.mkdtemp()` and passes its path as a plain string. The harness does not care how the candidate works. The candidate could be a deterministic patch applier (useful for harness self-tests), a real LLM agent, a fuzzer. The contract is the SampleResult."},{"key":"what-you-will-build","title":"What you will build","markdown":"`main.py` ships:\n\n1. `FixtureTask` dataclass.\n2. `SampleResult` dataclass: success_self_reported, latency_ms, cost_units, edits.\n3. `TaskReport`, `EvalReport` dataclasses with `to_dict()`.\n4. `VerifierRegistry` mapping verifier name to function. Built-in verifiers: file_equals, regex_match, shell_exit_zero.\n5. `EvalHarness` class. Runs a directory of tasks against a candidate. Returns EvalReport.\n6. Five fixture tasks bundled in `tasks/`:\n - off-by-one in `fizzbuzz`\n - missing return in `factorial`\n - typo in error message\n - empty function body\n - off-by-one in linked-list traversal\n7. A deterministic reference candidate (`apply_known_fixes`) the harness uses to demonstrate a clean pass@1 of 1.0.\n8. Demo prints the EvalReport JSON and exits zero.\n\nThe fixture tasks are bundled as JSON files in `tasks/` plus paired source files in `tasks//buggy/` and `tasks//expected/`. The harness copies buggy into a scratch dir, hands it to the candidate, and verifies against expected."},{"key":"why-pass-k-and-not-just-pass-1","title":"Why pass@k and not just pass@1","markdown":"Real LLM agents are stochastic. A pass@1 of 0.6 looks like a failure. A pass@5 of 0.95 says the agent gets the right answer most of the time but is choosing wrong on early samples. The fix is sampling and ranking, not always more training. Pass@k makes that visible.\n\nPass@k is reported alongside pass@1 because pass@k papers over a real failure: if the model gets the right answer once in twenty tries you do not have a useful agent. The harness shows both."},{"key":"how-this-composes-with-the-rest-of-track-a","title":"How this composes with the rest of Track A","markdown":"Lesson 25 produced the gate chain. Lesson 26 produced the sandbox. The harness uses the sandbox for any `shell_exit_zero` verifier. Lesson 28 wraps each harness run in an OTel trace. Lesson 29 runs the end-to-end demo against one of the bundled fixtures and asserts pass@1 = 1.0 for the reference candidate."},{"key":"running-it","title":"Running it","markdown":"```bash\ncd phases/19-capstone-projects/27-eval-harness-fixture-tasks\npython3 code/main.py\npython3 -m pytest code/tests/ -v\n```\n\nThe demo prints the EvalReport in JSON, including pass@1, pass@5, mean latency, and per-task breakdown. The exit code is zero. The tests cover the verifier functions, the pass@k math, fixture loading, and the harness end-to-end against the bundled reference candidate."}],"quiz":[{"id":"pre-1","stage":"pre","question":"Before reading the lesson: which property does a good agent eval verifier need above all others?","options":["Speed in the millisecond range.","Determinism across runs.","A natural-language explanation of the verdict.","Coverage of every line of the candidate's code."],"correct":1,"explanation":"Determinism is what makes a regression detectable. A non-deterministic verifier turns every red into a coin flip."},{"id":"check-2","stage":"check","question":"A candidate scores 3 passes in 5 samples on a single task. What value does the lesson's pass@5 formula return (rounded to two decimals)?","options":["0.60","0.99","0.75","1.00"],"correct":1,"explanation":"pass@k = 1 - (1 - p)^k = 1 - (1 - 0.6)^5 = 1 - 0.4^5 = 1 - 0.01024 = 0.98976, which rounds to 0.99. The raw per-sample rate (0.6) is the value option (a) names, but the formula amplifies it across k samples."},{"id":"check-3","stage":"check","question":"Why does the lesson include both pass@1 and pass@k in the report?","options":["pass@1 is the value users see in the dashboard; pass@k is for internal debugging.","pass@k can hide a model that is right only one in many samples, so pass@1 anchors the result to first-attempt quality.","pass@k and pass@1 are mathematically identical for k > 1.","pass@1 is the version supported by the verifier registry; pass@k is reserved for future verifiers."],"correct":1,"explanation":"pass@k makes models that get the answer once in many tries look strong; pass@1 is the first-attempt floor. Reporting both prevents over-claiming."},{"id":"check-4","stage":"check","question":"The harness calls _prepare_scratch before every sample. Why is that important for pass@k > 1?","options":["It avoids race conditions on shared temp directories.","It ensures each sample starts from the buggy setup, not the previous sample's output.","It is purely a performance optimisation.","It is only relevant for shell verifiers."],"correct":1,"explanation":"Without a fresh scratch, sample N+1 starts from sample N's output. The harness would measure 'second-attempt cleanup' rather than first-attempt success."},{"id":"post-5","stage":"post","question":"You add a new verifier 'pytest_passes' that runs pytest in the scratch dir. Which existing verifier should you base it on?","options":["file_equals, because pytest writes a results file.","regex_match, because pytest output contains the word 'passed'.","shell_exit_zero, because pytest signals pass via exit code 0.","It is impossible to express pytest as a verifier."],"correct":2,"explanation":"pytest exits 0 on pass and non-zero on failure. shell_exit_zero is the right primitive. Production wiring routes the call through the sandbox from lesson 26."},{"id":"post-6","stage":"post","question":"A reviewer asks: 'is this regression in the model or in the harness?' Which artifact answers that question?","options":["The agent's chain-of-thought transcript.","The per-task, per-sample SampleResult plus the verifier detail message.","The verifier registry source code.","The fixture JSON files."],"correct":1,"explanation":"Per-sample latency, verifier detail, and pass/fail let you triage: same fixtures, different verdicts, so the model changed; same verdicts but different latency, so something else moved."}],"sourcePath":"phases/19-capstone-projects/27-eval-harness-fixture-tasks"}]}]