codeact prompting June 11, 2026 • 6 min read

Your Agent Should Write Python, Not Emit JSON: Prompting for CodeAct Now That It Ships in Microsoft Agent Framework

BUILD 2026 made code-as-action a first-class pattern. Here's how to rewrite your tool-use prompts from "pick a tool" to "write code that calls the tools" and cut tokens 30-50%.

For two years the default way to give an agent tools has been the same: list the tools, tell the model to think step by step, and have it emit one JSON payload per turn. Tool call, result, tool call, result. Every intermediate result flows back through the model's context before it decides the next move.

At BUILD 2026 on June 2, Microsoft shipped a different default in Agent Framework. It's called CodeAct, and Anthropic's Programmatic Tool Calling does the same thing. The model stops picking one tool per turn. Instead it writes a single sandboxed Python program that calls your tools directly, with loops, conditionals, and variables, runs it once, and hands back only the final answer.

Here's the part the vendor announcements bury: the difference is mostly in your prompt. Same model, same tools. You change the instruction from "select a tool and emit JSON" to "write code that calls the tools," and you get measurably cheaper, faster runs on the right kind of task. Let's look at the actual rewrite.

The before: JSON tool-calling

This is the prompt pattern you almost certainly have now.

The Prompt:

You are a helpful assistant with access to the following tools:
- search_contacts(query)
- get_deals(contact_id)
- get_invoices(deal_id)

Think step by step. On each turn, respond with exactly one tool
call as a JSON object: {"tool": "<name>", "args": {...}}.
Wait for the result, then decide your next tool call. When you
have enough information, give the final answer in plain text.

This works. It's also expensive when the task involves many calls or bulky data. Say you want total invoiced revenue for every contact at a company. The model calls search_contacts, gets 40 contacts back into its context, then loops one contact at a time: a get_deals turn, a get_invoices turn, for each. That's dozens of round trips, and every deal record and invoice line lands in context whether the model needs to reason about it or not.

The original CodeAct research (Wang et al., ICML 2024) measured this. Writing actions as code instead of JSON used about 30% fewer steps on multi-tool tasks and hit up to 20 percentage points higher success. Hugging Face replicated the 30% step reduction independently in smolagents. Fewer steps means fewer LLM calls means lower cost.

The after: CodeAct

Same tools, different instruction.

The Prompt:

You have an execute_code tool. Inside it, these host functions
are callable directly:
- search_contacts(query) -> list[Contact]
- get_deals(contact_id) -> list[Deal]
- get_invoices(deal_id) -> list[Invoice]

Write ONE Python program that completes the task. Use loops,
conditionals, and variables to chain and filter results. Return
only the final consolidated answer. Do not pull raw intermediate
records into your reasoning. Process them in code and log only
what you need.

Why This Works: You collapse a multi-turn plan into one model turn. The model writes the loop once, the loop runs in the sandbox, and the 40 contacts and their invoices never touch the model's context. The agent sees only what the code explicitly returns.

Expected Output:

python total = 0 for contact in search_contacts("Acme Corp"): for deal in get_deals(contact.id): for inv in get_invoices(deal.id): total += inv.amount print(f"Total invoiced revenue for Acme Corp: ${total:,.2f}") Total invoiced revenue for Acme Corp: $284,500.00

One turn instead of dozens. Microsoft's framing for representative workloads: roughly 50% lower end-to-end latency and over 60% fewer tokens, because a five-step plan that used to be five model turns becomes one execute_code turn.

The bigger savings is the intermediate-data tax. Anthropic's example: copying a two-hour meeting transcript from Drive to Salesforce routes the transcript through context twice with traditional tool use, about 50,000 extra tokens. Keep it in the sandbox and that scenario drops from 150,000 tokens to 2,000. On their 75-tool benchmark, turning on Programmatic Tool Calling cut billed input tokens about 38% with no change in accuracy. Across production traffic with 10 to 49 tools, typical savings run 20 to 40%.

Treat the 98% numbers as best-case single scenarios. Plan around 20 to 50% on genuinely multi-tool, data-heavy work.

When you should NOT do this

This is the honest half, and it matters more than the hype half.

Code-as-action is not free. On simple workflows it costs more. Anthropic tested it on tasks where each turn makes one or two sequential calls and found scores unchanged and cost about 8% higher. Their own guidance: sequential single-call workflows do not benefit.

The break-even sits around 10-plus tools or large data payloads. If your agent makes three fixed API calls in a known order, look up a record, format it, send an email, stay on JSON tool-calling. Wrapping that in a code sandbox adds overhead and a Python traceback to debug, and buys you nothing. Don't migrate a workflow just because a framework now supports the fancier pattern.

Code-actions win on three shapes specifically: iteration (loop over an unknown number of items), branching (the next call depends on the last result), and aggregation (pull a lot, return a little). If your task isn't one of those, you're paying for a sandbox you don't need.

The guardrail prompts that actually matter

When the model writes code instead of picking from a fixed menu, you've handed it more freedom. Two guardrails carry most of the weight.

First, constrain what the code can touch.

The Prompt:

Use ONLY the host functions listed above. Do not import network
or filesystem modules. Do not invent functions. If the task can't
be done with the listed functions, say so instead of writing code.

Second, and this is the one practitioners keep repeating: the sandbox protects the host, not your business. A micro-VM stops generated code from wrecking the machine. It does nothing to stop a legitimate-looking call from sending the wrong email or issuing a refund.

Microsoft is explicit that CodeAct approvals apply to the execute_code call as a whole, not to individual operations inside it, and that tools reached through the code path still execute in the host process. So keep destructive or irreversible actions as direct agent tools with their own approval gate. Do not make them callable from inside the code sandbox.

Practically: get_invoices is safe to expose to code. send_payment is not. Keep it a direct tool so it triggers its own human-in-the-loop check every time.

One more thing the migration forces on you. Because the model now writes code against your function signatures, sloppy tool definitions break it. Microsoft notes that names, parameter metadata, and return shapes matter more here than when the model just picks one tool at a time. The upside: moving to CodeAct makes you clean up your docstrings and type hints, and that cleanup helps the JSON path too.

Where this leaves you

The pattern went from a 2024 research paper to a shipped, vendor-supported default across Microsoft, Anthropic, Cloudflare, and Hugging Face inside about six months. Microsoft's HyperlightCodeActProvider now injects the runtime guidance and runs each program in a fresh micro-VM, so the system-prompt rewrite mostly ships for you. Your job is deciding which workflows earn it.

Audit your agents. Count the tools and look at the data volume per run. Anything iterating, branching, or aggregating over 10-plus tools is a candidate. Anything making a few fixed calls stays on JSON. Rewrite the prompt, keep the dangerous tools out of the sandbox, and measure your token bill before and after.

If your team is rewriting tool-use prompts and wants to get the CodeAct-versus-JSON call right the first time, we run live, hands-on prompt engineering training. Connect with Kief Studio on Discord or schedule a session.

Training

Want your team prompting like this?

Kief Studio runs hands-on prompt engineering workshops tailored to your stack and workflows.

Newsletter

Get techniques in your inbox.

New prompt engineering guides delivered weekly. No spam, unsubscribe anytime.

Subscribe