Applied

Evals

Automated test suites for LLM outputs -- measuring accuracy, safety, or task-specific quality metrics against labeled inputs, so you know when a prompt or model change helps or regresses.

First published April 14, 2026

Evals are to LLM apps what unit tests are to code, with the same non-negotiable status. Without evals, you're guessing whether a new model, prompt, or RAG config is better. With evals, you ship with confidence.

A minimum viable eval: 50-100 labeled input/expected-output pairs, an automatic scorer (exact match, fuzzy match, LLM-as-judge, or human rubric), and a runner that produces a pass/fail score. Run before every prompt change, every model upgrade, every significant retrieval tweak. Frameworks: Braintrust, Langsmith, Humanloop, Anthropic evals, or homegrown.

Example Prompt

# Minimal eval runner
import json

test_cases = json.loads(open("evals/support-tickets.jsonl").read())
total = correct = 0
for case in test_cases:
    output = call_llm(case["input"])
    score = llm_judge(case["expected"], output)
    total += 1
    correct += score
print(f"Pass: {correct}/{total} ({100*correct/total:.1f}%)")

# Ship-gating rule: no prompt change lands if eval score drops > 2 points.

When to use it

  • Any LLM feature that will ship -- evals are the price of admission
  • Comparing prompt variants or models
  • Detecting regressions when a vendor updates their model

When NOT to use it

  • Prototype stage -- a ~10-case eval is fine, don't over-invest before product-market fit
  • As a substitute for human review in high-stakes decisions