Agentic

LLM-as-Judge

Using an LLM to evaluate or score the output of another LLM (or the same one) against criteria -- automating quality control without human review in the loop.

First published April 14, 2026

LLM-as-judge is how you scale eval past the point where humans can review every sample. A strong model scores the task output against a rubric: rate factual accuracy 1-5, pick the better of two answers, detect hallucinations.

Reliable when the judge is materially stronger than the task model (GPT-5 judging Llama-3 outputs is legit; GPT-5 judging itself is noisy). Biases to watch: position bias (the first option shown wins more often), length bias (longer answers score higher), self-preference (models prefer their own style). Mitigations: randomize order, normalize lengths, use a different model family for the judge.

Example Prompt

Judge prompt:

Task: Rate the following answer to a customer question.

Rubric (each scored 1-5):
- Accuracy: factually correct per the product documentation below
- Helpfulness: addresses the user's real concern
- Concision: says only what is necessary

Documentation: [...]
Question: [...]
Answer to rate: [...]

Return JSON: {accuracy, helpfulness, concision, notes}

When to use it

  • Scaling evals beyond human-review capacity
  • Production quality gates (auto-reject outputs that score below threshold)
  • Comparative evals (A/B between prompt variants)

When NOT to use it

  • The judge is the same model as the task model -- self-preference poisons results
  • The rubric is ambiguous (judge scores vary by run)
  • High-stakes decisions that need verified human sign-off