MCP response bloat June 25, 2026 • 6 min read

Your Agent Reads the Transcript Twice: Stop Piping Tool Outputs Through Context and Filter Them in Code

The 'response bloat' tax quietly doubles your token bill. Here's the prompt pattern that keeps raw tool payloads out of the model's working memory.

Here's a workflow that looks harmless. Your agent pulls a sales-meeting transcript out of Google Drive, then writes the relevant parts into a Salesforce record. Two tool calls. Done.

Now count the tokens. The transcript comes back from Drive and lands in the model's context. Then the model turns around and sends that same transcript into the Salesforce call. The full payload passes through the model twice. Anthropic measured this exact case: the naive version burned around 150,000 tokens. Rewritten so the data gets handled in code instead of routed through the model, it dropped to about 2,000. That's a 98.7% cut on one task.

Most people who optimize agents go after the wrong cost first. Let me show you the one that actually matters, and the prompt pattern that fixes it.

Two taxes, not one

When your agent uses tools, you pay twice.

The first tax is schema bloat: the cost of loading every tool definition into context before any work starts. This one got all the attention. People learned that a single GitHub MCP server can register 93 tools and eat roughly 55,000 tokens before the user types anything. In one real deployment, three servers consumed 143,000 of a 200,000-token window just sitting there. So the ecosystem built lazy-loading, tool search, and definition compression. Good. That tax is real.

The second tax is response bloat: the cost of tool outputs flowing back through context on every hop. This one is quieter and usually bigger. A ten-step workflow doesn't just pay for the latest output. Every prior result rides along in context on each new call. The transcript that came back in step two is still there in step seven, being re-read, being re-billed.

If your tool output is larger than your reasoning, you're prompting the loop wrong.

Why this is an accuracy problem, not just a billing one

It would be easy to frame this as "your bill doubles, spend less." That's not the real reason to care.

Stuffing context with raw payloads makes the model worse at its job. Stanford's "lost in the middle" work found model performance drops more than 20% when the relevant detail sits buried in the middle of a long context. A 50,000-token transcript wedged between your instruction and the model's next decision is exactly that kind of burial.

So filtering outputs in code isn't a cost hack. It keeps the model's working memory small enough to actually reason over. Cheaper is a side effect. Sharper is the point.

The pattern: transform in the environment, return a handle

The fix has a clean shape. The raw payload should land in the execution environment, get reduced there, and only a small result, or a reference to where the big thing lives, should come back to the model.

Anthropic's own framing makes it concrete. Picture checking budget compliance across 20 employees. The naive approach does 20 round-trips and drags thousands of expense line items through context. The code approach runs one script that does all 20 lookups, filters inside the runtime, and returns only the employees who went over. The model reasons over a handful of lines instead of hundreds of kilobytes.

This is now a shipped feature, not a trick. Programmatic Tool Calling reached general availability with Claude Sonnet 4.6, and the recommended code_execution_20260120 tool version no longer needs a beta header. On a 75-tool benchmark, Anthropic measured about 38% fewer billed input tokens with no drop in task accuracy. Across production traffic, typical savings landed in the 20 to 40% range.

But the feature only does its job if you tell the model what you actually want. That's a prompting decision.

The prompt that keeps payloads out of context

This is the instruction that turns the behavior on. Drop it into your system prompt for any agent that touches large tool outputs.

The Prompt:

You have access to a code execution environment and a set of tools you
can call from within it.

When a tool returns a large payload (a document, a query result set, a
file, an API response with many records), do NOT return that payload to
me directly. Instead:

1. Call the tool from inside the execution environment.
2. Transform and filter the result in code. Extract only what the task
   needs: specific fields, an aggregate, a count, the matching rows.
3. If a large result must survive for a later step, write it to a file
   and keep the path. Do not paste it back into our conversation.
4. Return to me ONLY one of:
   (a) a 3-line summary of what you found,
   (b) the specific fields or values I asked for, or
   (c) a file path / handle I can reference in a later instruction.

Never echo a raw payload into context to "show your work." If I need
to inspect the raw data, I will ask for the path and read it myself.

Why This Works: It moves the data-handling burden off the model and into the runtime, where filtering rows or pulling fields costs nothing in tokens. By forcing the return value to be a summary, a field set, or a handle, you cap how much any single tool result can pollute context, so the model's reasoning stays in the part of the window it reads most reliably.

Expected Output:

I fetched the Q3 sales transcript from Drive (id: 1xKp..., ~48k tokens) into the execution environment and extracted the action items and the stated deal value. I wrote the structured summary to ./work/q3-deal-summary.json and pushed those three fields into the Salesforce record 0061x....

Summary: 4 action items captured, deal value $82,000, close date pushed to Q4. The full transcript stays in Drive; I did not load it into our conversation. Want the parsed JSON path or the Salesforce record link?

Notice what didn't happen. The transcript never showed up in the reply. The model handled 48,000 tokens of text and handed you back five lines.

Wiring it up

On the request side, a tool opts into this flow by adding "allowed_callers": ["code_execution_20260120"] to its definition, and you include the code_execution tool in the request. After that, Claude calls the tool from inside the sandbox and decides in code what comes back, instead of round-tripping every result through context.

You don't need to convert every tool. Convert the ones that return big things: document fetchers, database queries, list endpoints, anything that can come back with hundreds of records.

When this is the wrong move

Be honest about where the pattern doesn't pay.

On benchmarks with just one or two sequential tool calls per turn, programmatic tool calling left accuracy flat and cost about 8% more. The overhead of spinning up code execution isn't free, and on a simple single-call flow there's nothing to filter. A small payload going through context once is fine. Leave it alone.

The pattern earns its keep on multi-hop chains, fan-out work (the 20-employee lookup), and any step that returns a large payload. That's the test: more than one hop, or a payload bigger than the answer.

And if you don't have an engineering team to wire up sandboxes yet, the cheapest fix is discipline, not a feature. Stop adding "just one more" MCP server. Write the instruction above into your prompts so the model defaults to returning summaries and handles. Most of the win is the habit, not the infrastructure.

The one rule to keep

Watch the ratio. If a tool result coming back to your model is bigger than the reasoning the model does with it, that result belongs in code, not in context. Reduce it where it lands, return a handle, and let the model think in the small window it reads best.

If your team is wiring up agentic workflows and watching token bills climb faster than the work justifies, we run live, hands-on prompt engineering training on exactly this kind of pattern. Connect with Kief Studio on Discord or schedule a session.

MCP response bloat tool output token cost programmatic tool calling agent context filtering agentic workflows

Training

Want your team prompting like this?

Kief Studio runs hands-on prompt engineering workshops tailored to your stack and workflows.

Schedule Training Join Discord

Newsletter

Get techniques in your inbox.

New prompt engineering guides delivered weekly. No spam, unsubscribe anytime.