Landscape

The eval tooling landscape

By Brenn Hill · Updated June 2026

There is no single "eval tool." The space splits into four overlapping layers, and most teams need at least two of them. Picking one product to do everything is the common mistake. This is a neutral survey of what the major tools do and how to choose — drawn entirely from Part VII of the EDD codex. The space moves fast, so treat everything here as a snapshot as of mid-2026.

The four layers

Practitioner comparisons converge on the same shape: a lightweight CI/test framework plus an observability platform, with two more specialized layers around the edges.

Layer 1 — CI/offline test frameworks. Run a versioned dataset against a scorer suite and gate a build. This is the core mechanic of eval-driven development. Promptfoo, DeepEval, Inspect AI, Ragas.
Layer 2 — tracing/observability platforms with evals bolted on. Capture production traces and run online scoring; their offline-eval depth is generally shallower than dedicated frameworks. Langfuse, Arize Phoenix, W&B Weave, LangSmith, Helicone.
Layer 3 — RAG-specialized evals. Purpose-built reference-free metrics for retrieval pipelines. Ragas.
Layer 4 — research-grade benchmark harnesses. Measure model capability on academic benchmarks — useful for model selection, not for evaluating your app's behavior. lm-evaluation-harness, HELM.

The tools, by layer

OSS-versus-commercial is genuinely mixed here, so read the license rather than the marketing. "Open core" frequently means the useful collaboration features (dashboards, regression tracking) live in a paid tier even when the framework itself is permissively licensed.

Tool	Layer	License / model	Best fit
Promptfoo	CI / offline framework	OSS, MIT (now part of OpenAI; remains MIT)	Config-as-spec CI runner; red-teaming; RAG
Inspect AI	CI / offline framework	OSS (UK AISI + Meridian Labs)	Agentic & safety evals; sandboxing untrusted model code; can drive external agents
DeepEval	CI / offline framework	OSS Python (open-core; paid Confident AI tier)	pytest-native regression gating; 50+ metrics; RAG
Ragas	RAG-specialized library	OSS, Apache-2.0	Reference-free retrieval quality: faithfulness, answer relevancy, context precision/recall
Braintrust	Observability + eval platform	Commercial / SaaS-only core (OSS scorer lib: autoevals)	Experiment tracking + release gating; online scoring
Langfuse	Observability + eval platform	OSS, MIT core (enterprise ee folders); cloud + self-host	Portable tracing + dataset experiments backbone
Arize Phoenix	Observability + eval platform	Source-available, Elastic License 2.0 (not OSI)	OTEL tracing + datasets/experiments in one stack
W&B Weave	Observability + eval platform	SDK OSS, Apache-2.0; value in hosted platform	Agent-native traces + scored experiments + guardrail scorers
LangSmith	Observability + eval platform	Commercial / proprietary (LangChain/LangGraph are MIT)	Eval + tracing if already on LangChain
Helicone	Observability / AI gateway	OSS, Apache-2.0; self-hostable + cloud	Trace/score sink — does not run evals itself
TruLens	Observability + eval library	OSS, MIT (TruEra, now Snowflake)	Portable OTEL traces + feedback scoring; agent evaluators
Patronus AI	Eval + observability + guardrails	Primarily commercial / hosted (some OSS, e.g. Lynx)	RAG hallucination & agent failure detection
OpenAI Evals	Offline / research harness	OSS, MIT (light maintenance)	Baseline framework; closest to OpenAI models
lm-evaluation-harness	Research benchmark harness	OSS, MIT (EleutherAI)	Model capability benchmarking / model selection
HELM	Research benchmark harness	OSS, Apache-2.0 (maintenance mode since Jun 2026)	Holistic multi-metric model selection

What each layer is actually for

Layer 1 carries the load for EDD. Promptfoo evaluates and red-teams via a declarative config (prompts by providers by test cases), with deterministic assertions (contains, regex, latency, cost) plus model-assisted ones, and runs locally with strong CI/CD integration. DeepEval is pytest-integrated and ships research-backed metrics, designed for regression gating — it feels like unit testing for LLMs. Inspect AI is the standout for agentic and safety work: it sandboxes untrusted model code in Docker/Kubernetes, has built-in tool and MCP support and model-graded scorers, and can drive external agents like Claude Code and Gemini CLI.

Layer 2 captures what production actually does. Langfuse, Phoenix, Weave, LangSmith, and Helicone capture traces and run online scoring. They are where drift and novel inputs surface. But their offline-eval depth varies and is generally shallower than a dedicated framework — and Helicone is explicit in its own docs that it does not run evaluations for you; it reports scores computed elsewhere. Prefer OTEL-based tracing (Phoenix, TruLens, Langfuse, Weave) where portability matters.

Layer 3 grades retrieval reference-free. Ragas decomposes RAG into faithfulness, answer relevancy, and context precision/recall — gradable without gold answers. It is a library, not a platform, so you bring your own orchestration and CI.

Layer 4 measures the model, not your app. EleutherAI's lm-evaluation-harness (the backend for HuggingFace's Open LLM Leaderboard) and Stanford CRFM's HELM measure capability on academic benchmarks. Use them to choose a model; a high benchmark score is not your app passing its evals.

LLM-as-judge is the common scoring mechanism — and it is fallible Nearly every tool above leans on LLM-as-judge scoring, and it is empirically unreliable if used naively. A 15-judge study over roughly 150k instances found systematic position bias — judges favoring answers by placement, not quality — driven by identifiable judge and task factors, not random noise. Treat judge prompts as code: pin temperature to 0, randomize ordering, and validate against human labels before trusting a green run.

Three pitfalls to price in

Lock-in and migration cost. LangSmith's tight LangChain/LangGraph coupling becomes a liability if you change frameworks; Braintrust and LangSmith are largely SaaS-only with no self-hosting on lower tiers; free trace tiers (LangSmith's 5k base traces, for example) can run dry within a week of real usage.
License is not a marketing detail. Permissive OSS includes Promptfoo, OpenAI Evals, lm-evaluation-harness (MIT) and Ragas, Helicone, HELM, the W&B Weave SDK (Apache-2.0) and TruLens (MIT). But Arize Phoenix is source-available under Elastic License 2.0, which is not OSI-approved and restricts offering it as a managed service. And open-core tools — DeepEval's Confident AI tier, Langfuse's enterprise ee folders, Weave's hosted platform — put the useful collaboration features behind a paywall.
Maintenance risk is real even for "official" tools. OpenAI Evals is in light maintenance — a Sep-2024-to-Nov-2025 activity gap, recent commits dominated by housekeeping — and HELM entered maintenance mode on June 1, 2026 per its own README. Do not assume a famous harness is actively developed. Consolidation adds another wrinkle: OpenAI acquired Promptfoo (announced March 2026, still MIT), and Snowflake backs TruLens — vendor independence today does not guarantee it tomorrow.

How to choose for EDD

For eval-driven development specifically, the load-bearing capability is a versioned dataset plus a scorer suite that runs deterministically in CI and fails the build on regression. DeepEval, Promptfoo, Inspect AI, and the experiment features of Braintrust and Langfuse all support this pattern; pure observability tools like Helicone explicitly do not run evals themselves and instead ingest scores from elsewhere.

The practical rule of thumb the comparisons converge on:

Start with a Layer-1 framework for the CI gate — Promptfoo or DeepEval for app behavior, Inspect AI when agents or untrusted code are involved.
Add a Layer-2 platform for production traces and online scoring, favoring OTEL-based options if you want to keep the door open to switching.
Don't expect one tool to be the best — combine by category, vet the license and maintenance status, and treat any judge-based scorer as code you must validate.

No tool here is "the best"; the right stack depends on your app, your framework commitments, and your appetite for lock-in. Next: how to build an eval harness for an LLM app, how to write evals for an AI coding agent, or start from the overview.

Newsletter

Get new eval-driven development essays by email

Practical evals for AI-assisted and agent code — the executable spec and the guardrail, not the vibe check. No spam, unsubscribe anytime.

Grounded in the EDD codex — Part VII, the vendor-neutral eval tooling landscape (tool licenses, fit, pitfalls, and the position-bias study behind the LLM-as-judge caveat).