LLM-as-Judge -- Qurtoo Glossary

LLM-as-judge is how you scale eval past the point where humans can review every sample. A strong model scores the task output against a rubric: rate factual accuracy 1-5, pick the better of two answers, detect hallucinations.

Reliable when the judge is materially stronger than the task model (GPT-5 judging Llama-3 outputs is legit; GPT-5 judging itself is noisy). Biases to watch: position bias (the first option shown wins more often), length bias (longer answers score higher), self-preference (models prefer their own style). Mitigations: randomize order, normalize lengths, use a different model family for the judge.

Example Prompt

Judge prompt:

Task: Rate the following answer to a customer question.

Rubric (each scored 1-5):
- Accuracy: factually correct per the product documentation below
- Helpfulness: addresses the user's real concern
- Concision: says only what is necessary

Documentation: [...]
Question: [...]
Answer to rate: [...]

Return JSON: {accuracy, helpfulness, concision, notes}

When to use it

Scaling evals beyond human-review capacity
Production quality gates (auto-reject outputs that score below threshold)
Comparative evals (A/B between prompt variants)

When NOT to use it

The judge is the same model as the task model -- self-preference poisons results
The rubric is ambiguous (judge scores vary by run)
High-stakes decisions that need verified human sign-off