Landscape
The eval tooling landscape
There is no single "eval tool." The space splits into four overlapping layers, and most teams need at least two of them. Picking one product to do everything is the common mistake. This is a neutral survey of what the major tools do and how to choose — drawn entirely from Part VII of the EDD codex. The space moves fast, so treat everything here as a snapshot as of mid-2026.
The four layers
Practitioner comparisons converge on the same shape: a lightweight CI/test framework plus an observability platform, with two more specialized layers around the edges.
- Layer 1 — CI/offline test frameworks. Run a versioned dataset against a scorer suite and gate a build. This is the core mechanic of eval-driven development. Promptfoo, DeepEval, Inspect AI, Ragas.
- Layer 2 — tracing/observability platforms with evals bolted on. Capture production traces and run online scoring; their offline-eval depth is generally shallower than dedicated frameworks. Langfuse, Arize Phoenix, W&B Weave, LangSmith, Helicone.
- Layer 3 — RAG-specialized evals. Purpose-built reference-free metrics for retrieval pipelines. Ragas.
- Layer 4 — research-grade benchmark harnesses. Measure model capability on academic benchmarks — useful for model selection, not for evaluating your app's behavior. lm-evaluation-harness, HELM.
The tools, by layer
OSS-versus-commercial is genuinely mixed here, so read the license rather than the marketing. "Open core" frequently means the useful collaboration features (dashboards, regression tracking) live in a paid tier even when the framework itself is permissively licensed.
| Tool | Layer | License / model | Best fit |
|---|---|---|---|
| Promptfoo | CI / offline framework | OSS, MIT (now part of OpenAI; remains MIT) | Config-as-spec CI runner; red-teaming; RAG |
| Inspect AI | CI / offline framework | OSS (UK AISI + Meridian Labs) | Agentic & safety evals; sandboxing untrusted model code; can drive external agents |
| DeepEval | CI / offline framework | OSS Python (open-core; paid Confident AI tier) | pytest-native regression gating; 50+ metrics; RAG |
| Ragas | RAG-specialized library | OSS, Apache-2.0 | Reference-free retrieval quality: faithfulness, answer relevancy, context precision/recall |
| Braintrust | Observability + eval platform | Commercial / SaaS-only core (OSS scorer lib: autoevals) | Experiment tracking + release gating; online scoring |
| Langfuse | Observability + eval platform | OSS, MIT core (enterprise ee folders); cloud + self-host | Portable tracing + dataset experiments backbone |
| Arize Phoenix | Observability + eval platform | Source-available, Elastic License 2.0 (not OSI) | OTEL tracing + datasets/experiments in one stack |
| W&B Weave | Observability + eval platform | SDK OSS, Apache-2.0; value in hosted platform | Agent-native traces + scored experiments + guardrail scorers |
| LangSmith | Observability + eval platform | Commercial / proprietary (LangChain/LangGraph are MIT) | Eval + tracing if already on LangChain |
| Helicone | Observability / AI gateway | OSS, Apache-2.0; self-hostable + cloud | Trace/score sink — does not run evals itself |
| TruLens | Observability + eval library | OSS, MIT (TruEra, now Snowflake) | Portable OTEL traces + feedback scoring; agent evaluators |
| Patronus AI | Eval + observability + guardrails | Primarily commercial / hosted (some OSS, e.g. Lynx) | RAG hallucination & agent failure detection |
| OpenAI Evals | Offline / research harness | OSS, MIT (light maintenance) | Baseline framework; closest to OpenAI models |
| lm-evaluation-harness | Research benchmark harness | OSS, MIT (EleutherAI) | Model capability benchmarking / model selection |
| HELM | Research benchmark harness | OSS, Apache-2.0 (maintenance mode since Jun 2026) | Holistic multi-metric model selection |
What each layer is actually for
Layer 1 carries the load for EDD. Promptfoo evaluates and red-teams via a declarative config (prompts by providers by test cases), with deterministic assertions (contains, regex, latency, cost) plus model-assisted ones, and runs locally with strong CI/CD integration. DeepEval is pytest-integrated and ships research-backed metrics, designed for regression gating — it feels like unit testing for LLMs. Inspect AI is the standout for agentic and safety work: it sandboxes untrusted model code in Docker/Kubernetes, has built-in tool and MCP support and model-graded scorers, and can drive external agents like Claude Code and Gemini CLI.
Layer 2 captures what production actually does. Langfuse, Phoenix, Weave, LangSmith, and Helicone capture traces and run online scoring. They are where drift and novel inputs surface. But their offline-eval depth varies and is generally shallower than a dedicated framework — and Helicone is explicit in its own docs that it does not run evaluations for you; it reports scores computed elsewhere. Prefer OTEL-based tracing (Phoenix, TruLens, Langfuse, Weave) where portability matters.
Layer 3 grades retrieval reference-free. Ragas decomposes RAG into faithfulness, answer relevancy, and context precision/recall — gradable without gold answers. It is a library, not a platform, so you bring your own orchestration and CI.
Layer 4 measures the model, not your app. EleutherAI's lm-evaluation-harness (the backend for HuggingFace's Open LLM Leaderboard) and Stanford CRFM's HELM measure capability on academic benchmarks. Use them to choose a model; a high benchmark score is not your app passing its evals.
Three pitfalls to price in
- Lock-in and migration cost. LangSmith's tight LangChain/LangGraph coupling becomes a liability if you change frameworks; Braintrust and LangSmith are largely SaaS-only with no self-hosting on lower tiers; free trace tiers (LangSmith's 5k base traces, for example) can run dry within a week of real usage.
- License is not a marketing detail. Permissive OSS includes Promptfoo, OpenAI Evals, lm-evaluation-harness (MIT) and Ragas, Helicone, HELM, the W&B Weave SDK (Apache-2.0) and TruLens (MIT). But Arize Phoenix is source-available under Elastic License 2.0, which is not OSI-approved and restricts offering it as a managed service. And open-core tools — DeepEval's Confident AI tier, Langfuse's enterprise ee folders, Weave's hosted platform — put the useful collaboration features behind a paywall.
- Maintenance risk is real even for "official" tools. OpenAI Evals is in light maintenance — a Sep-2024-to-Nov-2025 activity gap, recent commits dominated by housekeeping — and HELM entered maintenance mode on June 1, 2026 per its own README. Do not assume a famous harness is actively developed. Consolidation adds another wrinkle: OpenAI acquired Promptfoo (announced March 2026, still MIT), and Snowflake backs TruLens — vendor independence today does not guarantee it tomorrow.
How to choose for EDD
For eval-driven development specifically, the load-bearing capability is a versioned dataset plus a scorer suite that runs deterministically in CI and fails the build on regression. DeepEval, Promptfoo, Inspect AI, and the experiment features of Braintrust and Langfuse all support this pattern; pure observability tools like Helicone explicitly do not run evals themselves and instead ingest scores from elsewhere.
The practical rule of thumb the comparisons converge on:
- Start with a Layer-1 framework for the CI gate — Promptfoo or DeepEval for app behavior, Inspect AI when agents or untrusted code are involved.
- Add a Layer-2 platform for production traces and online scoring, favoring OTEL-based options if you want to keep the door open to switching.
- Don't expect one tool to be the best — combine by category, vet the license and maintenance status, and treat any judge-based scorer as code you must validate.
No tool here is "the best"; the right stack depends on your app, your framework commitments, and your appetite for lock-in. Next: how to build an eval harness for an LLM app, how to write evals for an AI coding agent, or start from the overview.
Get new eval-driven development essays by email
Practical evals for AI-assisted and agent code — the executable spec and the guardrail, not the vibe check. No spam, unsubscribe anytime.
Grounded in the EDD codex — Part VII, the vendor-neutral eval tooling landscape (tool licenses, fit, pitfalls, and the position-bias study behind the LLM-as-judge caveat).