agentic May 14, 2026 • 8 min read

Agent Outcome Rubrics Are Just Prompts: How to Write Grading Criteria That Actually Fail Bad Output

Anthropic shipped the grading loop -- the rubric is now the hardest prompt you'll write

On May 6, 2026, Anthropic held a keynote with no new model announcement. Dario Amodei opened with: "No new model today. Today is about how we are making our products work better for you." What they shipped instead was Outcomes -- a grading loop for Claude Managed Agents that improved task success on .pptx generation by 10.1% and .docx by 8.4%, with zero model changes.

The architecture is simple. A task agent does the work. A separate grading agent, running in its own context window with no access to the task agent's reasoning chain, scores the output against a developer-written rubric. If it fails, the grader returns per-criterion gaps and kicks the output back. Default iteration cap is 3. Max is 20.

That grading agent only knows what you tell it. The rubric is a prompt. And it's the most critical prompt in your entire agent workflow.

The rubber-stamp problem

The default failure mode isn't a strict grader. It's a grader that approves everything.

Write a rubric that says "verify the energy audit covers demand charges," and the grader will skim the output, find a paragraph that mentions demand charges, and approve. It won't check whether the numbers are correct, whether the methodology makes sense, or whether the section actually explains demand charges vs. just name-dropping the term.

This is the same failure mode you'd get with a vague prompt to any LLM. "Is this good?" produces "Yes, this looks good." The grading agent is an LLM. It responds to instruction quality the same way any model does.

Spivot, a marketing platform, learned this the hard way. Their AI-generated event pages had a 42% approval rate. The system was hallucinating attendee counts, assigning wrong cities, and miscategorizing events. When they added a QA grading agent with domain-specific rubrics -- including B2B event benchmarks like 2-5 attendees per exhibitor company and 10-50 attendees per speaker -- approval jumped to 95% at $0.10 per page. The rubric didn't just name what to check. It specified how to verify each criterion.

Why separation matters

You might ask: why not just have the task agent grade its own output? Research on LLM-as-a-judge shows that self-enhancement bias inflates scores by 5-7% when a model evaluates its own work. Verbosity bias adds another ~15%. GPT-4 shows up to 40% score inconsistency from position bias alone -- the order in which options appear changes the grade.

Anthropic's architecture addresses this directly. The grading agent runs in a separate context window. It never sees the task agent's chain-of-thought. It can only evaluate the artifact. This is the same principle behind code review: the person who wrote the code shouldn't be the only one who signs off on it.

Adding chain-of-thought reasoning to the grading agent doesn't help much, either. The RULERS framework (January 2026) found that CoT does not significantly improve self-consistency in LLM judges. The researchers recommend treating reasoning as "a configurable option rather than a universal default" for evaluation. What matters more is the rubric itself.

Three ways rubrics break

The RULERS framework identifies three failure modes that apply directly to Outcomes rubrics.

Rubric instability. Small changes in how you phrase a criterion produce different scores on the same output. If your rubric says "the code should be well-organized," one run might interpret that as "functions are grouped logically" and another as "files are named clearly." Lock your criteria to observable, countable properties.

Unverifiable reasoning. The grader assigns a score but can't point to specific evidence in the output that justifies it. This happens when criteria are abstract ("the document should be professional") instead of anchored to text ("the document uses formal register, avoids contractions, and includes section headers").

Scoring misalignment. Your rubric uses a 1-5 range, but the grader's confidence estimates don't map to meaningful distinctions between a 3 and a 4. Binary pass/fail per criterion is more reliable than numeric ranges for most use cases.

Writing rubrics that actually fail bad output

Here's a rubric for a file generation task -- say, producing a HIPAA compliance checklist. The naive version looks like this:

The Prompt (naive rubric):

Grade this HIPAA compliance checklist. Check that it covers:
- Administrative safeguards
- Physical safeguards
- Technical safeguards
- Breach notification requirements
Give a pass/fail grade.

Why This Fails: The grader will find section headers matching those four terms, confirm they exist, and pass. It won't verify completeness, accuracy, or whether the content under each header actually addresses the safeguard requirements. 73% of healthcare AI deployments fail HIPAA compliance because the checks are superficial -- they verify labels, not substance.

Here's the version that works:

The Prompt (effective rubric):

You are grading a HIPAA compliance checklist. Evaluate EACH criterion independently. For each, output PASS or FAIL with a one-sentence justification quoting the specific text that supports your judgment. If you cannot find supporting text, the criterion FAILS.

Criterion 1 - Administrative Safeguards Completeness:
The checklist MUST include specific steps for: (a) risk analysis procedures, (b) workforce training requirements with frequency, (c) contingency planning with RTO/RPO targets, (d) business associate agreement requirements. All four sub-items must be present with actionable detail, not just mentioned by name.

Criterion 2 - Technical Safeguards Specificity:
The checklist MUST specify: (a) encryption standards by name (e.g., AES-256, TLS 1.2+), (b) access control mechanisms (RBAC, MFA, session timeout values), (c) audit logging requirements including retention period. Vague statements like "use encryption" or "implement access controls" are FAIL.

Criterion 3 - Breach Notification Accuracy:
The checklist MUST state the 60-day notification window for covered entities (not 30, not 90). It must distinguish between notifications to individuals, HHS, and media (for breaches affecting 500+ individuals). Any incorrect timeframe or missing notification category is FAIL.

Criterion 4 - 2026 Regulatory Currency:
The checklist MUST reference that previously optional safeguards became mandatory in 2026, including AI-specific risk assessments. If it only covers pre-2026 requirements, FAIL.

Output format:
Criterion 1: PASS/FAIL - [justification with quoted text]
Criterion 2: PASS/FAIL - [justification with quoted text]
Criterion 3: PASS/FAIL - [justification with quoted text]
Criterion 4: PASS/FAIL - [justification with quoted text]

Overall: PASS only if ALL criteria pass. Otherwise FAIL, listing which criteria need revision.

Why This Works: Each criterion is independently evaluated with specific, verifiable conditions. The grader must quote evidence from the output -- it can't hand-wave approval. The criteria test for specific values (AES-256, 60-day window, 2026 mandate) that are either present or absent. There's no ambiguity about what a passing grade means.

Expected Output:

Criterion 1: FAIL - The checklist mentions "conduct regular risk analysis" but does not specify procedures. Workforce training section says "train employees on HIPAA" without frequency. Contingency planning is absent entirely. Business associate agreements are mentioned but without specific requirements.

Criterion 2: PASS - "All PHI at rest must use AES-256 encryption. PHI in transit requires TLS 1.2 or higher." Access controls specify "role-based access with mandatory MFA for all PHI systems, 15-minute session timeout." Audit logs "retained for minimum 6 years per federal requirement."

Criterion 3: FAIL - States "notify affected individuals within 30 days" -- the correct window is 60 days for covered entities. Does not distinguish media notification threshold for breaches affecting 500+ individuals.

Criterion 4: FAIL - No mention of 2026 regulatory changes or AI-specific risk assessment requirements.

Overall: FAIL - Criteria 1, 3, and 4 require revision.

Iteration caps are budget decisions

Every grading loop costs money. At 10,000 requests per day, running 10 iterations instead of 3 adds roughly $4,500 per month at mid-tier model pricing. The default cap of 3 isn't arbitrary -- for most tasks, meaningful improvement plateaus within a few cycles.

Harvey, the legal AI company, saw task completion rates increase roughly 6x after implementing Outcomes with tight rubrics. But they paired this with Anthropic's dreaming feature, where agents learn from past failures. The rubric catches drift; dreaming prevents the same mistake twice. Without both, you're paying for repeated corrections on the same class of error.

Set your iteration cap based on task complexity. A formatting check might need one retry at most. A legal document review might justify 5. If you're hitting the cap regularly, your rubric is probably too vague or your task prompt needs work -- the grading loop shouldn't be compensating for a bad initial prompt.

The eval gaming trap

If your rubric only checks word count, the task agent will chase word count. This isn't hypothetical -- there are documented cases of models gaming evaluation criteria at the expense of actual task quality.

The defense is criterion diversity. Include at least one criterion the task agent can't trivially satisfy by pattern matching. For a legal brief, that might be: "The brief must cite at least two case precedents by name, and each citation must be relevant to the specific legal question posed -- not just tangentially related to the topic." For a code generation task: "The generated function must handle the edge case where the input list is empty, returning an appropriate default rather than raising an exception."

The Autorubric framework (March 2026) takes this further, operationalizing criteria as instance-specific checklists with categorical weights. The idea is that evaluation criteria should adapt to the specific task instance, not just the task type.

What this means for your work

The grading loop is infrastructure. Anthropic provides it. The rubric is your IP -- it encodes your domain expertise, your quality bar, your edge case knowledge. A rubric that says "check that it's good" is worth nothing. A rubric that specifies exactly what good means, with verifiable evidence requirements for each criterion, is the difference between 42% and 95% approval rates.

Start with your failure modes. What are the specific ways your agent's output goes wrong? Each failure mode becomes a rubric criterion. Each criterion needs a verification method, not just a label. And each one should be evaluated independently -- a document can nail the structure and completely botch the facts.

The rubric is the hardest prompt you'll write because it has to anticipate failure modes you haven't seen yet. But every failure you catch and encode makes the next run better.

If your team is building agent workflows and struggling with output quality, Kief Studio runs hands-on training sessions on rubric design and agentic prompt engineering. Connect with us on Discord or schedule a session.

agentic prompt engineering evaluation rubrics Claude

Training

Want your team prompting like this?

Kief Studio runs hands-on prompt engineering workshops tailored to your stack and workflows.

Schedule Training Join Discord

Newsletter

Get techniques in your inbox.

New prompt engineering guides delivered weekly. No spam, unsubscribe anytime.