LLM-as-judge is how you scale eval past the point where humans can review every sample. A strong model scores the task output against a rubric: rate factual accuracy 1-5, pick the better of two answers, detect hallucinations.
Reliable when the judge is materially stronger than the task model (GPT-5 judging Llama-3 outputs is legit; GPT-5 judging itself is noisy). Biases to watch: position bias (the first option shown wins more often), length bias (longer answers score higher), self-preference (models prefer their own style). Mitigations: randomize order, normalize lengths, use a different model family for the judge.
Example Prompt
Judge prompt:
Task: Rate the following answer to a customer question.
Rubric (each scored 1-5):
- Accuracy: factually correct per the product documentation below
- Helpfulness: addresses the user's real concern
- Concision: says only what is necessary
Documentation: [...]
Question: [...]
Answer to rate: [...]
Return JSON: {accuracy, helpfulness, concision, notes}When to use it
- Scaling evals beyond human-review capacity
- Production quality gates (auto-reject outputs that score below threshold)
- Comparative evals (A/B between prompt variants)
When NOT to use it
- The judge is the same model as the task model -- self-preference poisons results
- The rubric is ambiguous (judge scores vary by run)
- High-stakes decisions that need verified human sign-off
