# LLM-as-judge rubric template

Use for subjective quality a deterministic check can't decide. Pin temperature to 0,
randomize answer order, and validate against a human expert before you trust it.
More: https://evaldrivendevelopment.dev/llm-as-judge-evals-when-and-how

---

ROLE
You are evaluating the output of <task>. First reason briefly, then return a verdict.
Do not reward longer answers or the answer shown first.

REFERENCE (when available)
<the known-good answer or the source the output must be faithful to>

CRITERIA — judge each as PASS or FAIL with a one-line reason:
1. <criterion 1> — PASS if <concrete, observable anchor>; otherwise FAIL.
2. <criterion 2> — PASS if <concrete, observable anchor>; otherwise FAIL.
3. <criterion 3> — PASS if <concrete, observable anchor>; otherwise FAIL.

OUTPUT (structured)
reasoning: <2–3 sentence critique>
per_criterion: [ {id: 1, verdict: pass|fail, why: "…"}, … ]
verdict: pass            # pass only if all required criteria pass

VALIDATION (before automating)
- Have one principal domain expert grade the same 30–50 outputs.
- Iterate this prompt until judge-vs-expert agreement is high (precision/recall; Cohen's kappa).
- Re-check periodically for drift.

—
From evaldrivendevelopment.dev/kit
