Eval-Driven Development

Landscape

The eval tooling landscape

There is no single "eval tool." The space splits into four overlapping layers, and most teams need at least two of them. Picking one product to do everything is the common mistake. This is a neutral survey of what the major tools do and how to choose — drawn entirely from Part VII of the EDD codex. The space moves fast, so treat everything here as a snapshot as of mid-2026.

The four layers

Practitioner comparisons converge on the same shape: a lightweight CI/test framework plus an observability platform, with two more specialized layers around the edges.

The tools, by layer

OSS-versus-commercial is genuinely mixed here, so read the license rather than the marketing. "Open core" frequently means the useful collaboration features (dashboards, regression tracking) live in a paid tier even when the framework itself is permissively licensed.

ToolLayerLicense / modelBest fit
PromptfooCI / offline frameworkOSS, MIT (now part of OpenAI; remains MIT)Config-as-spec CI runner; red-teaming; RAG
Inspect AICI / offline frameworkOSS (UK AISI + Meridian Labs)Agentic & safety evals; sandboxing untrusted model code; can drive external agents
DeepEvalCI / offline frameworkOSS Python (open-core; paid Confident AI tier)pytest-native regression gating; 50+ metrics; RAG
RagasRAG-specialized libraryOSS, Apache-2.0Reference-free retrieval quality: faithfulness, answer relevancy, context precision/recall
BraintrustObservability + eval platformCommercial / SaaS-only core (OSS scorer lib: autoevals)Experiment tracking + release gating; online scoring
LangfuseObservability + eval platformOSS, MIT core (enterprise ee folders); cloud + self-hostPortable tracing + dataset experiments backbone
Arize PhoenixObservability + eval platformSource-available, Elastic License 2.0 (not OSI)OTEL tracing + datasets/experiments in one stack
W&B WeaveObservability + eval platformSDK OSS, Apache-2.0; value in hosted platformAgent-native traces + scored experiments + guardrail scorers
LangSmithObservability + eval platformCommercial / proprietary (LangChain/LangGraph are MIT)Eval + tracing if already on LangChain
HeliconeObservability / AI gatewayOSS, Apache-2.0; self-hostable + cloudTrace/score sink — does not run evals itself
TruLensObservability + eval libraryOSS, MIT (TruEra, now Snowflake)Portable OTEL traces + feedback scoring; agent evaluators
Patronus AIEval + observability + guardrailsPrimarily commercial / hosted (some OSS, e.g. Lynx)RAG hallucination & agent failure detection
OpenAI EvalsOffline / research harnessOSS, MIT (light maintenance)Baseline framework; closest to OpenAI models
lm-evaluation-harnessResearch benchmark harnessOSS, MIT (EleutherAI)Model capability benchmarking / model selection
HELMResearch benchmark harnessOSS, Apache-2.0 (maintenance mode since Jun 2026)Holistic multi-metric model selection

What each layer is actually for

Layer 1 carries the load for EDD. Promptfoo evaluates and red-teams via a declarative config (prompts by providers by test cases), with deterministic assertions (contains, regex, latency, cost) plus model-assisted ones, and runs locally with strong CI/CD integration. DeepEval is pytest-integrated and ships research-backed metrics, designed for regression gating — it feels like unit testing for LLMs. Inspect AI is the standout for agentic and safety work: it sandboxes untrusted model code in Docker/Kubernetes, has built-in tool and MCP support and model-graded scorers, and can drive external agents like Claude Code and Gemini CLI.

Layer 2 captures what production actually does. Langfuse, Phoenix, Weave, LangSmith, and Helicone capture traces and run online scoring. They are where drift and novel inputs surface. But their offline-eval depth varies and is generally shallower than a dedicated framework — and Helicone is explicit in its own docs that it does not run evaluations for you; it reports scores computed elsewhere. Prefer OTEL-based tracing (Phoenix, TruLens, Langfuse, Weave) where portability matters.

Layer 3 grades retrieval reference-free. Ragas decomposes RAG into faithfulness, answer relevancy, and context precision/recall — gradable without gold answers. It is a library, not a platform, so you bring your own orchestration and CI.

Layer 4 measures the model, not your app. EleutherAI's lm-evaluation-harness (the backend for HuggingFace's Open LLM Leaderboard) and Stanford CRFM's HELM measure capability on academic benchmarks. Use them to choose a model; a high benchmark score is not your app passing its evals.

LLM-as-judge is the common scoring mechanism — and it is fallible Nearly every tool above leans on LLM-as-judge scoring, and it is empirically unreliable if used naively. A 15-judge study over roughly 150k instances found systematic position bias — judges favoring answers by placement, not quality — driven by identifiable judge and task factors, not random noise. Treat judge prompts as code: pin temperature to 0, randomize ordering, and validate against human labels before trusting a green run.

Three pitfalls to price in

  1. Lock-in and migration cost. LangSmith's tight LangChain/LangGraph coupling becomes a liability if you change frameworks; Braintrust and LangSmith are largely SaaS-only with no self-hosting on lower tiers; free trace tiers (LangSmith's 5k base traces, for example) can run dry within a week of real usage.
  2. License is not a marketing detail. Permissive OSS includes Promptfoo, OpenAI Evals, lm-evaluation-harness (MIT) and Ragas, Helicone, HELM, the W&B Weave SDK (Apache-2.0) and TruLens (MIT). But Arize Phoenix is source-available under Elastic License 2.0, which is not OSI-approved and restricts offering it as a managed service. And open-core tools — DeepEval's Confident AI tier, Langfuse's enterprise ee folders, Weave's hosted platform — put the useful collaboration features behind a paywall.
  3. Maintenance risk is real even for "official" tools. OpenAI Evals is in light maintenance — a Sep-2024-to-Nov-2025 activity gap, recent commits dominated by housekeeping — and HELM entered maintenance mode on June 1, 2026 per its own README. Do not assume a famous harness is actively developed. Consolidation adds another wrinkle: OpenAI acquired Promptfoo (announced March 2026, still MIT), and Snowflake backs TruLens — vendor independence today does not guarantee it tomorrow.

How to choose for EDD

For eval-driven development specifically, the load-bearing capability is a versioned dataset plus a scorer suite that runs deterministically in CI and fails the build on regression. DeepEval, Promptfoo, Inspect AI, and the experiment features of Braintrust and Langfuse all support this pattern; pure observability tools like Helicone explicitly do not run evals themselves and instead ingest scores from elsewhere.

The practical rule of thumb the comparisons converge on:

No tool here is "the best"; the right stack depends on your app, your framework commitments, and your appetite for lock-in. Next: how to build an eval harness for an LLM app, how to write evals for an AI coding agent, or start from the overview.

Grounded in the EDD codex — Part VII, the vendor-neutral eval tooling landscape (tool licenses, fit, pitfalls, and the position-bias study behind the LLM-as-judge caveat).