How-to
How to write evals for an AI coding agent
You handed an agent your repo and it opened a pull request. How do you know the change is good — not just that the agent said it fixed the issue? Evals are how. For coding agents the answer is unusually clean: the tests are the spec, and an eval is a task the agent must make pass.
Step 1 — Start from real failures, not imagined ones
The single highest-leverage move is error analysis: look at what your agent actually gets wrong. Pull 20–50 real tasks from your bug tracker, recent pull requests, and the agent's own transcripts. Categorize the failures. Your first eval set should be those failures — the spec is discovered by reading outputs, not invented at a whiteboard. Twenty to fifty tasks drawn from real failures is a great start.
Step 2 — Make each task executable
An agent eval is a task with three parts: a starting repo state, an
instruction, and a grader made of tests. Borrow the
pattern that the SWE-bench family standardized: the fix's tests must pass
(fail_to_pass) and the existing tests must stay green
(pass_to_pass). Run it in a container so the result is reproducible.
task: "fix-pagination-off-by-one"
repo_state: "your-repo at the commit before the fix"
instruction: |
Users report the last item on each page is missing.
Fix the pagination bug in src/list.py.
grade:
fail_to_pass: # must PASS after the change (proves the fix)
- tests/test_list.py::test_last_item_visible
pass_to_pass: # must STILL pass (no regressions)
- tests/test_list.py::test_first_page
- tests/test_list.py::test_empty_list
trials: 5 # run 5x; report pass^5, not a lucky single run That is the whole idea: apply the agent's diff, run the prescribed tests, mark the task resolved only if the fix tests pass and nothing regressed. It is unit testing pointed at a patch instead of a function.
Step 3 — Grade the outcome, not the path
Resist the urge to assert an exact sequence of tool calls. Agents routinely find valid approaches you didn't anticipate, so step-by-step matching produces brittle evals that fail on good work. Grade what the agent produced — the tests pass, the final state is correct — and reserve trajectory checks for cases where the process genuinely matters (e.g. "must not touch the payments module"). For long tasks, allow partial credit: an agent that localizes the bug but botches the fix is further along than one that flails immediately, and your eval should be able to see that.
Step 4 — Measure reliability, not a lucky run
Agents are non-deterministic, so a single green run is weak evidence. Run each task several times and report pass^k — the probability it passes every time — not just pass@k, the probability it passes at least once. A 70%-reliable agent looks like ~97% at pass@3 and ~34% at pass^3; for anything you'd let run unattended, the second number is the one that predicts your review burden.
Step 5 — Add a judge only for what code can't grade
Most of what matters about a code change is verifiable, so keep it on code-based graders. Where you genuinely can't — "is the PR description accurate?", "is this refactor readable?" — an LLM-as-judge can help, but treat it as code you have to test: pin its temperature, give it a concrete rubric, randomize ordering, and validate it against your own labels until it agrees before you let it gate anything. Never use a judge for something a test could decide.
Step 6 — Gate in CI, and split regression from capability
Wire the suite into the pipeline so it runs on every agent change and model upgrade. Keep two kinds of evals with opposite targets:
- Regression evals — things that already work; keep them near 100% pass and block any change that breaks them.
- Capability evals — harder tasks you can't pass yet; let them start low as bets on what's becoming possible, and watch the number climb.
Step 7 — Read the transcripts
The eval harness is the most common failure point — graders are wrong more often than you'd think. Real example from the field: a capable model scored 42% on a benchmark until the grading bugs were fixed, then jumped to 95%. Read the transcripts of passes and failures alike, and specifically watch for two traps:
- Solution leakage. If the answer is sitting in the issue text, you're grading reading comprehension, not engineering. Audits of public benchmarks found a large share of "solved" tasks had the fix in the prompt.
- Weak tests and reward hacking. If the tests are too loose, a plausible-but-wrong patch passes; if the agent can edit the tests or the environment, it will. Harden the grader and isolate the sandbox.
- 20–50 tasks from real failures, containerized and reproducible.
- Each task = repo state + instruction +
fail_to_pass&pass_to_passtests. - Grade the outcome; allow partial credit; avoid exact-path matching.
- Run multiple trials; report pass^k reliability.
- LLM-judge only for the unverifiable, and validate it first.
- CI gate; regression evals ~100%, capability evals start low.
- Read transcripts; check for leakage, weak tests, and reward hacking.
Where this goes next
Once the harness exists, it becomes the substrate for everything else: regression evals that catch agent drift after a model upgrade, capability evals that tell you when to raise the agent's autonomy, and the safety property that lets an agent change your codebase without breaking it. That last one is the real payoff — evals are what make a codebase safely modifiable by AI.
See also: EDD vs TDD for how this relates to testing, and the codex (Parts III, IV, and VI) for the benchmarks, agent-eval methods, and evidence behind each step above.
Get new eval-driven development essays by email
Practical evals for AI-assisted and agent code — the executable spec and the guardrail, not the vibe check. No spam, unsubscribe anytime.
Grounded in the EDD codex — Part III (execution-based grading,
fail_to_pass/pass_to_pass, contamination), Part IV (agent
trajectories, harness pitfalls, reliability), and Part VI (error analysis, CI gates,
reading transcripts).