thinking budgets April 20, 2026 • 6 min read

Tuning Thinking Budgets: The Prompt Technique That Replaced Chain-of-Thought in Claude 4.7 and GPT-5.4

Stop writing 'think step by step'. Start setting token budgets that match the task.

Every think step by step still sitting in your prompts is now costing you twice. Once in input tokens to send it. Once in degraded output, because reasoning-enabled models treat that phrase as an instruction to externalize their scratchpad on top of the hidden one they already have.

The replacement is not a better magic phrase. It is a number. Claude 4.7 takes a budget_tokens value. GPT-5.4 takes a reasoning.effort level. Setting that number correctly for the task is the prompt technique that matters in 2026.

Why CoT Phrasing Now Hurts

Chain-of-thought prompting was a 2022 hack for models that had no internal reasoning mode. You asked the model to reason out loud because that was the only way to reason at all.

Reasoning models already do this internally. Anthropic's extended-thinking docs tell you to give high-level guidance, not procedural steps. OpenAI's GPT-5 prompting guide calls CoT phrasing "redundant and can degrade output quality." Both vendors are saying the same thing: the model runs its own scratchpad, and telling it to run a second one in the visible answer produces longer, chattier, worse responses.

So the first fix is subtractive. Delete every instance of "think step by step," "let's reason through this," "walk through your logic," and "show your work" from your production prompts. That edit alone reduces output tokens and improves quality on most reasoning tasks.

Now we can talk about what to set instead.

The Four Budget Tiers

Real workloads fall into four tiers. Memorize these numbers.

  • 0 tokens (off). Classification, routing, JSON extraction, tagging, short rewrites.
  • 1k tokens. Short-form drafting, single-step tool calls, per-turn agent reasoning.
  • 8k tokens. Code review, medium refactors, structured analysis, multi-source summarization.
  • 32k tokens. Hard math, proofs, large architectural design, SWE-bench class tasks.

Above 32k, Anthropic's own eval curves flatten into noise. You are paying for reasoning you will not measure.

Let us walk each tier with a real prompt.

Tier 0: Classification With Reasoning Off

Classification is where budget discipline pays the most. The correct budget here is zero. On GPT-5.4 that means reasoning.effort: "minimal". On Claude 4.7 that means omitting the thinking block entirely.

The Prompt:

Classify this support ticket into one of: billing, bug, feature_request, auth, other.

Ticket:
"""
I was charged twice on April 3rd for the Pro plan. Can you refund one?
"""

Output JSON only: {"category": "..."}

Why This Works: The task is closed-set, short, and has an obvious answer. Reasoning adds latency without lift, and on some models it makes the output chattier, breaking the JSON contract. Clear schema plus no budget is faster and more reliable.

Expected Output:

{"category": "billing"}

If you run this same prompt with budget_tokens: 8000, you will pay for a few hundred reasoning tokens, wait longer, and sometimes get the model second-guessing itself into "other". That is not hypothetical. It is the failure mode engineers keep reporting on router-style calls.

Tier 1: Small Budget for Agent Turns

Agentic loops are where people overspend the hardest. The old instinct is one giant plan-then-execute call with a huge budget. The current best practice is interleaved thinking with a small per-turn budget.

The Prompt (per tool-use turn):

You are a deployment agent. You have access to: read_file, run_tests, git_diff, post_comment.

Current step: decide the next single action to verify the PR in #4821 is safe to merge.

Constraints:
- One tool call per turn.
- Do not plan beyond the next action.
- If you have enough information to post a final review, do that instead.

Paired with thinking: {type: "enabled", budget_tokens: 1024} per turn.

Why This Works: Small, repeated reasoning budgets track what you actually need: one decision at a time. Anthropic's own tool-use guide reports roughly forty percent fewer total reasoning tokens versus a single high-budget planning call, with equal or better task completion.

Expected Output:

The agent issues a git_diff call, sees the change touches auth middleware, then on the next turn requests read_file for the auth test suite before making a merge recommendation. Each turn uses a few hundred of its 1k budget.

Tier 2: 8k for Code Review

This is the sweet spot developers keep landing on empirically. Enough budget to reason about multi-file context. Not so much that the model starts proposing rewrites you did not ask for.

The Prompt:

Review the following diff for correctness, concurrency safety, and test coverage.

Scope rules:
- Only flag issues in the changed lines or their direct call sites.
- Do not propose refactors outside the diff.
- If tests are missing for a new public function, say so and stop there.

Diff:
<<<
[paste unified diff here]
>>>

Output format:
- Summary (2 sentences)
- Blocking issues (numbered list, or "none")
- Non-blocking suggestions (numbered list, or "none")

Paired with budget_tokens: 8192 on Claude 4.7, or reasoning.effort: "medium" on GPT-5.4.

Why This Works: The scope rules are the real technique. Reasoning models at 16k+ tend to wander into architectural opinions. Capping the budget at 8k and bounding the scope in the prompt keeps the review tight. The structured output format gives the model a finish line, which also helps it stop thinking.

Expected Output:

Summary: The change adds retry logic to the webhook handler and new HMAC verification. Test coverage is partial.

Blocking issues: 1. verify_signature() compares HMAC with == instead of hmac.compare_digest(), enabling a timing attack on line 47. 2. The retry loop does not bound total wall time, so a slow upstream can hold the worker indefinitely.

Non-blocking suggestions: 1. No test covers the 5xx retry path for process_webhook().

Tier 3: 32k for Hard Reasoning

Use this tier sparingly. Math olympiad problems. Formal proofs. Large greenfield architecture where the model genuinely needs to explore alternatives. If you are using this budget for a quote-drafting bot, you have misconfigured something.

The Prompt:

Design a rate limiter for a multi-tenant API with these constraints:
- 50k tenants, skewed traffic (p99 tenant is 1000x the median).
- Per-tenant quotas are mutable at runtime.
- Single-region deployment, Redis available.
- Latency budget for the limiter: p99 under 2ms.

Produce: chosen algorithm, data structures in Redis, failure modes, and the one tradeoff you made that you are least sure about.

Paired with budget_tokens: 32000.

Why This Works: Open-ended design with real constraints is where long reasoning actually pays back. The "least sure tradeoff" line is load-bearing. It forces the model to surface uncertainty instead of presenting the first plausible answer as confident.

Expected Output:

A sliding-window counter using a Redis sorted set per tenant, with a Lua script for atomic check-and-increment. Covers the hot-tenant skew with a two-tier approach: in-process token bucket for the top 100 tenants, Redis for the rest. Flags the two-tier split as the weakest part of the design because the promotion/demotion logic adds operational complexity.

Budgets Are Ceilings, Not Targets

One last thing that trips teams up. The budget is a cap on worst-case spend and latency. Claude finishes under budget constantly. Setting budget_tokens: 32000 does not mean every call uses 32k. It means no call exceeds 32k.

This is why "just crank it up" is the wrong instinct. The budget's real job is bounding the tail, not pushing the median.

Start at the lowest tier that plausibly fits the task. Move up only when you have an eval showing the higher tier wins on outcome, not on trace length.

Want to Get This Right Across Your Team

If your engineers are still pasting think step by step into every prompt, you are overpaying and getting worse output for it. Kief Studio runs hands-on prompt engineering training covering budget tuning, agentic workflows, and eval harnesses for reasoning models. Connect with us on Discord or schedule a session.

Training

Want your team prompting like this?

Kief Studio runs hands-on prompt engineering workshops tailored to your stack and workflows.

Newsletter

Get techniques in your inbox.

New prompt engineering guides delivered weekly. No spam, unsubscribe anytime.

Subscribe