Eval-Driven Development

Comparison

Eval-driven development vs. TDD and BDD

Eval-driven development is the third entry in a family of practices that all make the same move: write the check first, then build until it passes. Test-driven development put that check in code. Behavior-driven development made it readable and shared. Eval-driven development extends it to the one thing the first two assumed away — output that isn't deterministic.

The move all three share

TDD, BDD, and EDD are all forms of executable specification. In each, you state what "done" means as something you can run, before or as you build, and you let that artifact drive the work and guard against regressions. The differences are about who writes the check, in what language, and what kind of system it grades.

Test-driven development (TDD)

Kent Beck's TDD is the tight red-green-refactor loop: write a failing unit test, make it pass with the simplest change, refactor. The test is an executable spec written by a developer, at the level of a unit, and graded by exact match — the function returns 4 or it doesn't. It is fast, precise, and unambiguous, and it is the bedrock the other two build on.

Behavior-driven development (BDD)

Dan North's BDD grew out of TDD to answer two questions TDD left open: what should you test, and how do you keep the spec aligned with what the business actually wants. BDD shifts the vocabulary from "tests" to behavior, written in a shared, near-natural language so that developers, QA, and business stakeholders — the "three amigos" — agree on it together. Its signature form is specification by example: Given-When-Then scenarios (Gherkin, Cucumber) that are both human-readable and executable. BDD is outside-in (start from the behavior a user wants) where TDD is inside-out (start from the unit). Under the hood the assertions are still deterministic; what BDD adds is readability, a behavioral frame, and collaboration.

Eval-driven development (EDD)

EDD keeps the family's spine and extends it to AI-assisted and agent software, where the output is non-deterministic. You define evals — a dataset of real inputs, a success criterion, and a grader — and the AI iterates until they pass. The leap is that the thing under test can be probabilistic (the same input can pass once and fail the next run), and the criterion is often not exact-matchable. So EDD adds two things neither predecessor needed: graders that can be code, an LLM-as-judge, or a human, and results read as statistics rather than a single green tick.

Side by side

TDDBDDEDD
OriginatedKent Beck (~2003)Dan North (~2006)Emerging (2024–2026)
DrivesDeterministic codeBehavior + shared understandingAI-assisted and agent behavior
The "spec" isA unit testA Given-When-Then scenarioAn eval (dataset + criterion + grader)
Written byDevelopersDevs + business + QADevs + domain experts (from real failures)
Expressed inCodeNear-natural language (Gherkin)Examples + a rubric (code or judge)
System under testDeterministic unitsDeterministic behavior, end-to-endNon-deterministic model / agent
GraderCode assertionStep definitions (code)Code or LLM-judge or human
ResultBinary, exact matchBinary, scenario passGraded; statistical (pass^k, error bars)
Spec is setUp front (test-first)Up front (with stakeholders)Discovered via error analysis (criteria drift)
Typical failureBrittle, over-specified testsScenario bloat / "Cucumber theater"Contamination, judge bias, Goodharting

The same feature, three ways

The family resemblance is clearest when you write the same intent in each style.

TDD — a developer-level assertion, graded by exact match:

// TDD — a deterministic unit test, written first.
test("adds two numbers", () => {
  expect(add(2, 2)).toBe(4);
});

BDD — the behavior, in language everyone can read:

# BDD — a behavior scenario in shared, near-natural language.
Feature: Shopping cart
  Scenario: Adding a second item
    Given a cart containing 1 item
    When I add another item
    Then the cart shows 2 items

EDD — the same Given-When-Then shape, but the "Then" is graded, and a single run isn't enough:

# EDD — an eval case. Same Given/When/Then shape,
# but the "Then" is graded, and you read it statistically.
Given:  a customer email asking for a refund (case #214)
When:   the support agent drafts a reply
Then:   - reply is on-topic and polite     [grader: LLM-judge + rubric]
        - reply never promises a refund     [grader: code / regex]
        - no personal data is leaked        [grader: code]
Run it 5 times; require pass^5 (passes every time), not a lucky run.

EDD is closest to BDD — with one big twist

It's tempting to frame EDD as "TDD for AI," but it is spiritually nearer to BDD. Both specify behavior by example; both are outside-in (start from what a user should get); both depend on a shared understanding built with domain experts — BDD's ubiquitous language has a direct echo in EDD's rubric, hammered out with a "principal domain expert" who decides what good output looks like. An eval case even fits the Given-When-Then mould: Given an input, When the model or agent acts, Then the result satisfies a criterion.

The twist is everything that follows from non-determinism:

  1. The "Then" is graded, not asserted. Where BDD checks an exact outcome, EDD often grades quality, behavior, or faithfulness — sometimes with an LLM-judge, which is itself fallible and must be validated against human labels.
  2. A pass is statistical. The same case can pass once and fail next time, so you run it repeatedly and report reliability (pass^k, the chance it passes every time) and error bars — not a single green run.
  3. The spec is discovered. BDD writes scenarios up front with stakeholders; EDD leans on "criteria drift" — you learn your real criteria by grading actual outputs, so the eval set grows from error analysis, not an imagined list.

Do they replace each other? No — they layer

EDD does not retire TDD or BDD; it extends the family to the parts of a system that classical tests can't express. In a modern AI product all three coexist:

Reach for the cheapest one that fits: a deterministic test if the answer is verifiable, a behavior scenario if a stakeholder needs to read it, and an eval only when the output is probabilistic enough that exact-match would lie.

The through-line TDD made tests the spec. BDD made the spec readable and behavioral and shared. EDD makes the spec able to grade things that aren't deterministic — probabilistic output and agent behavior — by adding model and human graders and statistical reading. Same move, harder target.

Bottom line

Eval-driven development is the AI-era heir to a thirty-year lineage, not a break from it. If you've done TDD or BDD, you already know the rhythm: write the check first, build to pass it, gate it in CI, guard against regressions. EDD asks you to add three things the probabilistic world demands — graders that can be models, results read as statistics, and a spec you discover by looking at real failures.

For the focused head-to-head, see eval-driven development vs. test-driven development. Then start with the definition, learn how to write evals for a coding agent, or dig into the evidence in the codex.

Grounded in the EDD codex — Part VI (the practice and the TDD analogy), Part I (evals as experiments, pass@k vs pass^k), Part II (LLM-as-judge and its biases). TDD (Kent Beck) and BDD (Dan North; Gherkin/Cucumber; specification by example) are the analogy anchors.