# The one-page eval-driven development checklist

A pre-flight checklist before you let an AI agent or LLM feature ship. Print it, fill it in.

## Build the eval set
- [ ] Cases come from **real failures** (bug tracker, support queue, production traces) — not imagined ones.
- [ ] You did error analysis: read ~20–50 real outputs, categorized how they fail, until no new failure type appeared.
- [ ] Each recurring failure became a case in the suite.

## Choose graders (cheapest that fits)
- [ ] Verifiable things use **code / execution** graders (the tests are the spec).
- [ ] Subjective quality uses an **LLM-as-judge** — and only after it's validated against human labels.
- [ ] No judge is grading something a deterministic check could decide.

## Make the judge trustworthy (if you use one)
- [ ] Concrete score anchors; binary pass/fail with a written critique over 1–5 Likert.
- [ ] Reason-before-score; reference answer supplied when available.
- [ ] Position randomized (swap-and-average); length controlled; temperature pinned.
- [ ] Agreement with a human expert measured (precision/recall, Cohen's kappa) before automating.

## Read the result like a statistician
- [ ] Reliability measured with **pass^k** (passes every time), not a flattering pass@k.
- [ ] Multiple samples per case; error bars / a range across prompt formats reported.

## Gate it
- [ ] Suite runs in **CI** on every change and every model upgrade.
- [ ] **Regression** evals kept near 100% pass and block the build.
- [ ] **Capability** evals start low and are tracked, not blocked.

## Close the loop
- [ ] **Online** evals score a sample of production traffic.
- [ ] New production failures get added back into the golden set.

## Stay honest
- [ ] Watched for contamination, saturation, gaming/reward-hacking, and style-over-substance judges.
- [ ] Treat green as "no known regressions," not "correct." Pair evals with held-out tests and real feedback.

—
From evaldrivendevelopment.dev · the reasoning and 130+ cited sources are in the codex: https://evaldrivendevelopment.dev/codex