Free kit

The eval-driven development kit

Three copy-paste artifacts to start practising EDD today: a one-page checklist, a starter eval suite, and an LLM-as-judge rubric. Steal them, fill them in before you let an agent run. Each is free to download — no gate.

Newsletter

Get the kit and new essays by email

Grab the templates below, and get each new how-to, comparison, and checklist as it ships. No spam, unsubscribe anytime.

1 · The one-page checklist

A pre-flight before you ship an AI feature or let an agent change code. Download .md →

Build the eval set

Cases come from real failures (bug tracker, support queue, prod traces) — not imagined ones.
You did error analysis until no new failure type appeared; each recurring failure became a case.

Grade with the cheapest tool that fits

Verifiable things use code / execution graders (the tests are the spec).
Subjective quality uses an LLM-judge — only after it's validated against human labels.
No judge grades anything a deterministic check could decide.

Read it like a statistician, then gate it

Reliability measured with pass^k, not a flattering pass@k; multiple samples per case.
Runs in CI on every change and model upgrade; regression evals near 100% block the build, capability evals are tracked.
Online evals sample production; new failures feed back into the golden set.

Stay honest

Watch for contamination, saturation, gaming, and style-over-substance judges.
Green means "no known regressions," not "correct." Pair evals with held-out tests and real feedback.

2 · A starter eval suite

The shape that matters — dataset + criteria + graders, read statistically. Adapt it to your harness (how to build one). Download .yaml →

# Starter eval suite — tool-agnostic. Fill in and wire into CI.
suite: my-feature

dataset:
  # Build this from REAL failures and production traces, not imagined cases.
  cases:
    - id: case-001
      input: "…the real input that failed…"
      context: "…retrieved docs / state, if any…"
      criteria:
        - check: "output is valid, parseable JSON"
          grader: code              # cheapest, most reliable
        - check: "answer is faithful to the provided context"
          grader: llm-judge         # validate vs human labels first
        - check: "no personal data is leaked"
          grader: code

run:
  trials: 5                          # run each case repeatedly…
  report: pass^k                     # …and report reliability, not a lucky run

gate:
  regression: ">= 100% pass"         # already-working behaviours — block on break
  capability: track                  # harder bets — measure the trend, don't block

online:
  sample: 0.05                       # also score ~5% of live traffic
  feed_failures_back_into: dataset   # prod failures become golden-set cases

3 · An LLM-as-judge rubric

For the subjective quality a deterministic check can't decide. Pin temperature to 0, randomize order, and validate it before you trust it. Download .md →

ROLE
You are evaluating the output of <task>. First reason briefly, then a verdict.
Do not reward longer answers or the answer shown first.

REFERENCE (when available)
<the known-good answer, or the source the output must be faithful to>

CRITERIA — judge each PASS or FAIL with a one-line reason:
1. <criterion> — PASS if <concrete, observable anchor>; otherwise FAIL.
2. <criterion> — PASS if <concrete, observable anchor>; otherwise FAIL.

OUTPUT
reasoning: <2–3 sentence critique>
verdict: pass            # pass only if all required criteria pass

VALIDATE before automating: have one domain expert grade 30–50 outputs;
iterate this prompt until judge-vs-expert agreement is high (kappa); re-check for drift.

How to use the kit

Start with the definition if EDD is new to you. Then the checklist is your map; the suite is the thing you wire into CI per how to write evals for a coding agent; and the rubric is for the parts only a judge can grade — see writing grading rubrics for agent behavior. Assess where you stand with the maturity scorecard.

Newsletter

Get new eval-driven development essays by email

Practical evals for AI-assisted and agent code — the executable spec and the guardrail, not the vibe check. No spam, unsubscribe anytime.

The reasoning behind every line here — and 130+ cited sources — is in the EDD codex.