Building Retry Logic Into Agent Prompts: Self-Healing Workflows That Don't Need Human Intervention
Teach your agent to detect its own failures, diagnose the cause, and try a different approach
An agent that's 85% accurate per step will fail 4 out of 5 times on a 10-step task. That's not pessimism. That's Lusser's Law: P(success) = accuracy^steps. Even at 99% per-step reliability, chaining 10 steps drops you to 90.4%. At 100 steps, you're looking at a 63% overall failure rate.
The math makes one thing obvious. If you're building multi-step agent workflows, failure isn't a bug. It's the default outcome. The question isn't whether your agent will fail. It's whether it knows what to do when it does.
Most Retry Logic Solves the Wrong Problem
Here's the uncomfortable truth before we get into patterns: research on multi-agent LLM systems shows that nearly 79% of failures originate from specification ambiguity and coordination issues, not from individual step execution. Your agent didn't fail because the API timed out. It failed because it was doing the wrong thing in the first place.
Naive retry logic -- "if it fails, try again" -- makes this worse. A practitioner documented a production system where a task retried 40+ times because the validator was checking whether output existed, not whether it was correct. Any non-empty response passed validation. The agent kept producing garbage, the validator kept approving it, and the loop kept spinning.
Before you build retry logic, you need to answer: is this failure retriable? A timeout is retriable. A hallucinated plan is not. The first pattern handles exactly this distinction.
Pattern 1: Error Classification Before Retry
The most common mistake is treating all failures the same. Your prompt needs to teach the agent to classify what went wrong before deciding what to do about it.
The Prompt:
You are executing step {current_step} of {total_steps}: {step_description}
If this step fails or produces unexpected output, classify the failure before retrying:
TRANSIENT: External service timeout, rate limit, or temporary unavailability.
Action: Retry the same approach after a brief pause. Max 3 retries.
MALFORMED: Your output doesn't match the required schema or format.
Action: Re-read the output requirements and regenerate. Do NOT copy from your previous attempt.
LOGICAL: The approach itself is flawed -- wrong assumption, missing context, or contradictory requirements.
Action: Do NOT retry the same approach. State what assumption failed and propose an alternative approach. If no alternative exists, report the failure with your diagnosis.
TERMINAL: The task is impossible given current constraints (missing permissions, nonexistent resource, contradictory instructions).
Action: Stop immediately. Report what's blocking you and what would need to change.
Before any retry, state: "Failure type: [TYPE]. Reason: [one sentence]. Next action: [what you'll do differently]."
Why This Works: Forcing the agent to classify and verbalize the failure type before acting prevents the "retry the same broken thing" loop. The classification categories map to genuinely different recovery strategies. Transient failures need patience. Malformed output needs re-reading the spec. Logical failures need a new approach entirely. Terminal failures need a human.
Expected Output:
Failure type: MALFORMED. Reason: JSON output included markdown code fences that break parsing. Next action: Regenerating output with strict JSON-only formatting, no wrapper text.
The key insight: "Do NOT copy from your previous attempt" on malformed errors. Without this, agents will often reproduce the same structural mistake with cosmetic changes.
Pattern 2: The Verification Checkpoint
PwC found that adding independent validation agents to their CrewAI architecture improved accuracy from 10% to 70%. You don't need a separate agent to get most of this benefit. You can build verification directly into your prompt chain.
The Prompt:
You just completed: {step_description}
Your output was: {step_output}
Before proceeding to the next step, verify your own output:
1. Does this output actually satisfy what was asked? Re-read the original request: "{original_request}"
2. Does this output contain any claims, data, or specifics you invented rather than derived from the input?
3. If the next step uses this output as input, will it have everything it needs? What's the next step expecting?
If ANY check fails:
- State which check failed and why
- Regenerate the output addressing the specific issue
- Do not proceed until all three checks pass
If all checks pass, state "Verified" and continue.
Why This Works: Most agent failures cascade. A slightly wrong output at step 3 becomes a completely wrong output at step 7. This checkpoint catches drift early. The third check is especially important -- it forces the agent to think about downstream dependencies rather than just whether the current output "looks right" in isolation.
Expected Output:
Check 1: Pass -- output provides the requested API endpoint mapping. Check 2: FAIL -- I stated the /users endpoint accepts PATCH requests, but the input specification only mentions GET and POST. This was invented. Regenerating with corrected HTTP methods...
Pattern 3: Retry With Modified Approach
This is where most retry implementations fall apart. Retrying the same prompt with the same context produces the same result. You need to structurally change the approach on each retry.
The Prompt:
You are on attempt {attempt_number} of {max_attempts} for: {task_description}
{if attempt_number > 1}
Previous attempts and their failures:
{for each previous attempt}
- Attempt {n}: {approach_taken}. Failed because: {failure_reason}
{end for}
You MUST use a different approach than all previous attempts. Specifically:
- Do not reuse the same reasoning structure
- If a previous attempt failed on data interpretation, try a different data source or interpretation method
- If a previous attempt produced the wrong format, start from the format requirements and work backward
State your new approach in one sentence before executing it.
{end if}
If this is attempt {max_attempts} and it fails, do not retry. Instead, return:
- What you were trying to accomplish
- What approaches you tried and why each failed
- Your best partial result, clearly marked as incomplete
- What a human would need to resolve this
Why This Works: Feeding back previous failure reasons forces genuine variation between attempts. The "state your new approach in one sentence" requirement makes the agent commit to a different strategy rather than making cosmetic tweaks. And the graceful degradation at max retries ensures you always get useful diagnostic output instead of an empty failure.
Pattern 4: Circuit Breaker for Multi-Step Chains
Borrowed from microservices architecture, the circuit breaker pattern prevents a failing step from burning through your entire token budget. This is the pattern that prevents runaway costs -- a real problem documented in the AppWorld benchmark, where the absence of retry limits caused an agent to burn tokens indefinitely on a fundamentally broken approach.
The Prompt:
You are executing a {total_steps}-step workflow. Track your reliability as you go.
After each step, record: step number, success/failure, and attempt count.
Rules:
- If any single step takes more than 3 attempts: STOP. Report the bottleneck.
- If 2 or more steps have failed at least once in the last 5 steps: STOP. The workflow is unstable.
- If your total attempt count across all steps exceeds {total_steps * 2}: STOP. You're spending more effort on recovery than on the actual work.
When you stop, provide:
1. Steps completed successfully (with outputs)
2. The failing step and your diagnosis
3. Whether the completed steps have standalone value or depend on the full chain
Do not apologize. Do not retry past the limits. The limits exist because continuing past them wastes resources without improving outcomes.
Why This Works: The three circuit breaker conditions catch different failure modes. The single-step limit catches hard blockers. The rolling failure rate catches cascading instability. The total attempt budget catches death-by-a-thousand-cuts. That last instruction -- "Do not apologize. Do not retry past the limits" -- is necessary because models will often try to be "helpful" by pushing past stated boundaries.
Putting It Together: A Complete Self-Healing Chain
In practice, you combine these patterns. Error classification feeds into the retry-with-modified-approach pattern. Verification checkpoints run between steps. The circuit breaker wraps the whole thing.
The Prompt:
You are an autonomous agent executing a multi-step task. Your workflow:
Task: {task_description}
Steps: {step_list}
For each step:
1. Execute the step
2. Verify your output (Does it satisfy the request? Is everything derived from input, not invented? Will the next step have what it needs?)
3. If verification fails, classify the failure (TRANSIENT/MALFORMED/LOGICAL/TERMINAL) and handle accordingly
4. Record: step number, success/failure, attempts used
Retry rules:
- TRANSIENT: Retry same approach, max 3 times
- MALFORMED: Regenerate from requirements, max 2 times
- LOGICAL: New approach required, explain what changed, max 2 times
- TERMINAL: Stop and report immediately
Circuit breakers:
- Any step exceeds its retry limit: Stop, report completed work and diagnosis
- 2+ failures in last 5 steps: Stop, the workflow is unstable
- Total attempts exceed {len(steps) * 2}: Stop, resource limit reached
On any stop condition, return completed outputs (clearly marked as partial if the chain is incomplete), a diagnosis of what failed, and what would need to change for success.
Why This Works: This is the full pattern -- classification, verification, escalating retry strategies, and hard circuit breakers. The agent has clear rules for every failure mode, explicit limits that prevent runaway execution, and a requirement to always produce useful output even when it can't complete the full task.
The Honest Caveat
Retry logic makes your agents more resilient. It doesn't make them reliable. If your 10-step chain fails 80% of the time, retry logic might get you to 60% failure rate. That's a real improvement. But if you need 95%+ reliability, the answer isn't better retries. It's shorter chains, typed contracts between steps, and independent validation.
The patterns in this post are the difference between an agent that fails silently and one that fails usefully. That's a real distinction. A failed run that tells you why it failed and what it completed is worth ten failed runs that just return an error code.
Start with error classification. It's the smallest change with the biggest impact. Once your agent can tell the difference between "try again" and "try differently," everything else follows.
Want hands-on training on building resilient agent workflows for your team? Connect with Kief Studio on Discord or schedule a session.