Evals are to LLM apps what unit tests are to code, with the same non-negotiable status. Without evals, you're guessing whether a new model, prompt, or RAG config is better. With evals, you ship with confidence.
A minimum viable eval: 50-100 labeled input/expected-output pairs, an automatic scorer (exact match, fuzzy match, LLM-as-judge, or human rubric), and a runner that produces a pass/fail score. Run before every prompt change, every model upgrade, every significant retrieval tweak. Frameworks: Braintrust, Langsmith, Humanloop, Anthropic evals, or homegrown.
Example Prompt
# Minimal eval runner
import json
test_cases = json.loads(open("evals/support-tickets.jsonl").read())
total = correct = 0
for case in test_cases:
output = call_llm(case["input"])
score = llm_judge(case["expected"], output)
total += 1
correct += score
print(f"Pass: {correct}/{total} ({100*correct/total:.1f}%)")
# Ship-gating rule: no prompt change lands if eval score drops > 2 points.When to use it
- Any LLM feature that will ship -- evals are the price of admission
- Comparing prompt variants or models
- Detecting regressions when a vendor updates their model
When NOT to use it
- Prototype stage -- a ~10-case eval is fine, don't over-invest before product-market fit
- As a substitute for human review in high-stakes decisions
