speculative planning May 28, 2026 • 8 min read

Speculative Planning Prompts: How to Make Your Agents Pre-Plan During Tool Call Idle Time

The IdleSpec technique turns dead wait time into planning cycles that cut agent task completion by 40%

Your agent spends most of its time doing nothing. A five-step workflow with 2-second tool calls burns 10+ seconds of wall-clock time even if the LLM responds instantly. Model inference accounts for roughly 40% of total latency in production agent systems. The rest is dead air -- waiting on file reads, API responses, and network round-trips.

What if your agent planned its next move while it waited?

That's the core idea behind speculative planning. Three independent research efforts published between late 2025 and May 2026 converged on the same insight: agents exhibit predictable tool-call sequences, and you can exploit that predictability to overlap thinking with waiting. The prompt patterns that make this work are writable today across Claude, GPT, and Gemini agent loops.

How Speculative Planning Works

The concept borrows directly from CPU branch prediction. Your processor doesn't wait for a conditional to resolve before starting the next instruction. It guesses the most likely branch, starts executing, and throws away the work if the guess was wrong. The penalty for a miss is small. The reward for a hit is that execution never stalls.

Agent workflows have the same property. File reads almost always succeed. Standard API calls usually return 200. After a search, the agent almost always reads the top result. These patterns recur so predictably that a fast, cheap model can guess the next action with 60-80% accuracy while the current tool call is still in flight.

Microsoft Research's PASTE framework (Pattern-Aware Speculative Tool Execution, March 2026) demonstrated a 48.5% reduction in average task completion time using this approach. It requires only 1-3 idle CPU cores and 250MB of additional memory. The overhead for pattern prediction and scheduling runs under 100ms.

A separate line of work from ICLR 2025 -- Interactive Speculative Planning -- achieved up to 42.3% latency reduction using a dual-agent system. A fast, cheap "approximation agent" generates candidate next steps while the expensive "target agent" verifies. When they agree (which happens most of the time), execution continues at the fast agent's speed. When they disagree, you fall back to normal speed. No correctness penalty. No training required. It works with any model combination.

The Dual-Agent Prompt Pattern

This is the most practical pattern you can implement today. You run two models in parallel: a small model that speculates, and your main model that verifies.

Here's the system prompt for the speculation agent:

The Prompt (Speculation Agent System Prompt):

You are a planning-only agent. You do NOT execute actions. Your job is to predict the most likely next step in an agent workflow based on the current state and the tool call that is currently in progress.

Given:
- The agent's goal
- The steps completed so far
- The tool call currently executing

Output ONLY:
1. Your predicted next action (tool name + arguments)
2. Your confidence level (high/medium/low)
3. A one-line rationale

Rules:
- Predict the single most likely next step, not multiple options
- If the current tool call could fail, predict the success-path action (failures are handled by the main agent)
- Base predictions on common agent patterns: read-after-search, parse-after-fetch, write-after-read
- Never predict actions that would modify external state (writes, posts, deletes) unless the prior steps make it nearly certain

Current goal: {goal}
Steps completed: {steps_completed}
Tool in progress: {current_tool_call}

Why This Works: The prompt constrains the speculation agent to single-step prediction with explicit confidence scoring. The "never predict state-modifying actions" rule prevents speculative side effects. By focusing on the success path, it avoids wasting cycles on error-handling branches that rarely fire.

Expected Output:

Predicted next action: read_file(path="/src/config/database.ts", lines="1-50") Confidence: high Rationale: After a grep search that returned a file path, agents read the matched file 87% of the time.

The orchestrator logic is straightforward. While your main agent's tool call executes, you fire the speculation prompt to a cheap model (Claude Haiku, GPT-4o-mini). If the prediction matches what the main agent actually requests next, you've already got the result ready. If it doesn't match, you discard it and proceed normally.

The Inline Speculative Planning Pattern

You don't always need two models. You can prompt a single agent to include speculative plans in its own output, so the planning happens during the same inference call that triggers the tool use.

The Prompt (Single-Agent System Addition):

After each tool call you issue, append a SPECULATIVE_PLAN block:

<speculative_plan>
  <if_success>
    Next action: [tool_name(args)] 
    Rationale: [why this is the most likely next step]
  </if_success>
  <if_failure>
    Fallback: [brief recovery strategy]
  </if_failure>
</speculative_plan>

Your orchestrator may pre-execute the success-path action while waiting for the current tool result. This does not affect your decision-making -- you still evaluate the actual result and choose your real next step. The speculative plan is a hint, not a commitment.

Why This Works: By asking the agent to plan in the same generation that produces the tool call, you get the speculation "for free" in terms of latency. The structured XML format makes it easy for your orchestrator to parse and act on. The explicit note that "the speculative plan is a hint, not a commitment" prevents the agent from anchoring on its own speculation when the actual result differs.

Expected Output:

search_files(query="database connection pool", path="/src/")

xml <speculative_plan> <if_success> Next action: read_file(path="[first match from search results]") Rationale: Standard pattern -- search then read the most relevant match </if_success> <if_failure> Fallback: Broaden search to include "db" and "pool" as separate terms </if_failure> </speculative_plan>

When Speculation Costs More Than It Saves

Here's the part most write-ups skip: speculative planning can increase your bill.

Agents already burn 5-50x more tokens than chat interactions. A single agentic coding task can consume 1-3.5 million tokens including retries. Adding speculative inference on top of that is only cost-effective when two conditions hold:

First, your speculation model must be cheap. Running Opus as your speculation agent defeats the purpose. The math works when speculation runs on Haiku ($0.25 per million input tokens) or GPT-4o-mini while your main agent runs on a reasoning model. A speculation call that costs 1/20th of your main model's inference is a rounding error. A speculation call at the same price tier is a 30-60% cost increase with no accuracy improvement.

Second, your hit rate must stay above 50%. Research shows 60-80% accuracy on predictable workflows (file operations, standard API patterns). But novel tool combinations, complex branching logic, or workflows that depend heavily on intermediate results drop the hit rate below the break-even point. If you're speculating on unpredictable steps, you're paying for inference you'll throw away.

The honest recommendation: most teams should fix two things before they touch speculation.

Reduce unnecessary tool calls. An agent making twelve API calls to answer a question that needed two is a more common and more fixable problem. Then parallelize independent tool calls. If your agent reads three files sequentially when it could read them concurrently, you're leaving a 3x speedup on the table with zero additional cost.

Speculation is the third optimization, not the first.

A Complete Orchestrator Prompt With Speculation

Here's a full system prompt for an agent orchestrator that includes speculative planning, model routing, and the guardrails that keep costs under control.

The Prompt:

You are an orchestrator managing a tool-calling agent. Your objective: minimize wall-clock time for task completion while maintaining correctness.

## Execution Rules

1. After dispatching any tool call, immediately generate a speculative_plan for the most likely next action.
2. Route speculative planning to the fast model (model_id: "fast"). Route verification and complex reasoning to the primary model (model_id: "primary").
3. Only speculate on read-path actions (searches, file reads, data fetches). Never speculatively execute write-path actions (file writes, API posts, database mutations).
4. Track your speculation hit rate. If it drops below 50% over the last 10 predictions, disable speculation for the remainder of this task and log the reason.
5. When a tool call returns, compare the actual result against your speculative plan. If the speculation was correct, use the pre-fetched result. If not, discard it and proceed normally.

## Speculation Format

After each tool call, output:
SPEC: {tool_name}({args}) | confidence: {high|medium|low} | reason: {one line}

## Cost Guardrails

- Maximum speculative calls per task: 20
- Skip speculation if the current step involves: user input, external API with rate limits, or any action requiring authentication refresh
- If total token usage exceeds {budget_limit}, disable speculation and continue in sequential mode

Why This Works: The prompt establishes a clear decision boundary between read-path speculation (safe) and write-path speculation (forbidden). The automatic hit-rate monitoring prevents runaway costs on unpredictable tasks. The cost guardrails cap worst-case spending. And routing speculation to the fast model keeps per-call costs negligible.

What This Looks Like in Practice

May 2026 research on Speculative Interaction Agents showed 1.6-2.2x speedups even on small edge-scale models (3B parameters). The technique works at every scale. A coding agent that reads, searches, and edits files has highly predictable sequences. A customer support agent that looks up orders, checks shipping status, and drafts responses follows repeatable patterns. A research agent that searches, reads results, and synthesizes findings does the same three-step loop dozens of times.

The patterns are already in your workflows. You just need to tell your agent to predict them.

Start with one workflow. Identify the two or three most common tool-call sequences. Add the inline speculation prompt to your system message. Route the speculation through the cheapest model your provider offers. Measure your hit rate over 50 runs. If it's above 60%, you'll see meaningful latency reduction. If it's below 50%, focus on reducing unnecessary tool calls first.

The technique is model-agnostic. It works with Claude, GPT, Gemini, and open-source models running locally. The only requirement is an orchestrator layer that can dispatch tool calls and compare results against predictions.

Your agents are already thinking. Make them think ahead.

If your team wants to build faster agent workflows with speculative planning and other advanced prompting techniques, connect with Kief Studio on Discord or schedule a session.

speculative planning agentic workflows agent optimization idle time planning prompt engineering

Training

Want your team prompting like this?

Kief Studio runs hands-on prompt engineering workshops tailored to your stack and workflows.

Schedule Training Join Discord

Newsletter

Get techniques in your inbox.

New prompt engineering guides delivered weekly. No spam, unsubscribe anytime.