Eval-Driven Development

Free kit

The eval-driven development kit

Three copy-paste artifacts to start practising EDD today: a one-page checklist, a starter eval suite, and an LLM-as-judge rubric. Steal them, fill them in before you let an agent run. Each is free to download — no gate.

1 · The one-page checklist

A pre-flight before you ship an AI feature or let an agent change code. Download .md →

2 · A starter eval suite

The shape that matters — dataset + criteria + graders, read statistically. Adapt it to your harness (how to build one). Download .yaml →

# Starter eval suite — tool-agnostic. Fill in and wire into CI.
suite: my-feature

dataset:
  # Build this from REAL failures and production traces, not imagined cases.
  cases:
    - id: case-001
      input: "…the real input that failed…"
      context: "…retrieved docs / state, if any…"
      criteria:
        - check: "output is valid, parseable JSON"
          grader: code              # cheapest, most reliable
        - check: "answer is faithful to the provided context"
          grader: llm-judge         # validate vs human labels first
        - check: "no personal data is leaked"
          grader: code

run:
  trials: 5                          # run each case repeatedly…
  report: pass^k                     # …and report reliability, not a lucky run

gate:
  regression: ">= 100% pass"         # already-working behaviours — block on break
  capability: track                  # harder bets — measure the trend, don't block

online:
  sample: 0.05                       # also score ~5% of live traffic
  feed_failures_back_into: dataset   # prod failures become golden-set cases

3 · An LLM-as-judge rubric

For the subjective quality a deterministic check can't decide. Pin temperature to 0, randomize order, and validate it before you trust it. Download .md →

ROLE
You are evaluating the output of <task>. First reason briefly, then a verdict.
Do not reward longer answers or the answer shown first.

REFERENCE (when available)
<the known-good answer, or the source the output must be faithful to>

CRITERIA — judge each PASS or FAIL with a one-line reason:
1. <criterion> — PASS if <concrete, observable anchor>; otherwise FAIL.
2. <criterion> — PASS if <concrete, observable anchor>; otherwise FAIL.

OUTPUT
reasoning: <2–3 sentence critique>
verdict: pass            # pass only if all required criteria pass

VALIDATE before automating: have one domain expert grade 30–50 outputs;
iterate this prompt until judge-vs-expert agreement is high (kappa); re-check for drift.

How to use the kit

Start with the definition if EDD is new to you. Then the checklist is your map; the suite is the thing you wire into CI per how to write evals for a coding agent; and the rubric is for the parts only a judge can grade — see writing grading rubrics for agent behavior. Assess where you stand with the maturity scorecard.

The reasoning behind every line here — and 130+ cited sources — is in the EDD codex.