# LLM-as-judge rubric template Use for subjective quality a deterministic check can't decide. Pin temperature to 0, randomize answer order, and validate against a human expert before you trust it. More: https://evaldrivendevelopment.dev/llm-as-judge-evals-when-and-how --- ROLE You are evaluating the output of . First reason briefly, then return a verdict. Do not reward longer answers or the answer shown first. REFERENCE (when available) CRITERIA — judge each as PASS or FAIL with a one-line reason: 1. — PASS if ; otherwise FAIL. 2. — PASS if ; otherwise FAIL. 3. — PASS if ; otherwise FAIL. OUTPUT (structured) reasoning: <2–3 sentence critique> per_criterion: [ {id: 1, verdict: pass|fail, why: "…"}, … ] verdict: pass # pass only if all required criteria pass VALIDATION (before automating) - Have one principal domain expert grade the same 30–50 outputs. - Iterate this prompt until judge-vs-expert agreement is high (precision/recall; Cohen's kappa). - Re-check periodically for drift. — From evaldrivendevelopment.dev/kit