reasoning tokens efficiency June 1, 2026 • 6 min read

Strip the 'Wait' Tokens: Two Papers Prove Your Reasoning Model Is Wasting Half Its Budget Thinking Out Loud

The NoWait technique cuts chain-of-thought length 27-51% with zero accuracy loss -- here's how to apply it in Claude, GPT, and Gemini

Ask DeepSeek R1 what 2+3 equals. It will burn through 10,000 tokens before answering "5."

That's not a joke. It's a documented finding from the ThoughtTerminator benchmark (arXiv 2504.13367), and it illustrates a problem that just got two independent research papers in June 2025. Reasoning models overthink. They waste tokens on self-doubt loops, redundant re-derivations, and hesitation words that add nothing. And you're paying for every one of those tokens.

Two new papers prove it's worse than we thought -- and show how to fix it without retraining anything.

The NoWait discovery

"Wait, We Don't Need to 'Wait'!" (arXiv 2506.08343) introduces a dead-simple technique: suppress self-reflection tokens like "Wait," "Hmm," and "Alternatively" by zeroing their logits during decoding. No fine-tuning. No model changes. Just tell the model it can't say those words.

The results across five R1-style model series on ten benchmarks: chain-of-thought length dropped 27-51%. Accuracy didn't move.

Read that again. Half the reasoning tokens were doing nothing. The model was re-proving algebra it already solved, second-guessing conclusions it already reached, and generating filler words that triggered redundant deliberation loops. The paper describes standard reasoning as a process that "pauses to highlight every minor thought, making the logic scattered and less efficient." NoWait produces the same correct reasoning paths, just without the scenic detour.

When thinking tokens trap the model

The second paper, "Do Thinking Tokens Help or Trap?" (arXiv 2506.23840), asks a sharper question. On simple tasks, do thinking tokens actually make answers worse?

Yes. The authors identify what they call the "thinking trap": on straightforward problems, tokens like "wait" and "however" trigger high-level reasoning behaviors (reflection, backtracking, hypothesis revision) that aren't just wasteful but actively derail the model. A model that would have answered correctly in 200 tokens instead spends 2,000 tokens talking itself out of the right answer.

This lines up with earlier work from April 2025. "When More Thinking Hurts" (arXiv 2604.10739) found that extended reasoning past roughly 7,000 tokens enters an "overthinking zone" where the model is more likely to abandon a correct answer than discover a new one. Easier problems cross that threshold sooner. Your simple classification task doesn't just cost more with extended thinking -- it gets less accurate.

The real cost of invisible reasoning

Here's where this gets expensive. A query to OpenAI's o3 that produces 500 visible output tokens may consume 2,000 to 5,000 reasoning tokens behind the scenes, all billed at the output rate. One team using Claude Opus with extended thinking for 40-character text snippets saw costs 30x higher than expected: 30 input tokens, 18 visible output tokens, but 1,200 thinking tokens per call (TokenMix).

Gemini has its own version of this problem. Send a request to Gemini 2.5 Pro with max_tokens: 10, and you might get back an empty response. All ten tokens got consumed by internal reasoning before a single output character was produced. The model thought its entire budget away.

"Think step by step" is now counterproductive

This is the part that catches people off guard. "Think step by step" was the single most reliable prompting technique from 2023 through 2024. If you're using a reasoning model in 2025, it's actively working against you.

Reasoning models (o1, o3, o4-mini, Claude with extended thinking, Gemini with thinking enabled) already think internally. Adding "think step by step" triggers redundant deliberation on top of the built-in reasoning. You're asking a calculator to show its work, except the "shown work" can introduce errors the internal reasoning wouldn't have made.

On non-reasoning models (Claude Sonnet without thinking, GPT-4o, Gemini Flash without thinking), "think step by step" still boosts accuracy 10-40%. The technique isn't dead. It just needs to match the model tier.

What you can do right now

If you have API access, every major provider now offers reasoning budget controls:

  • Claude: thinking.type: 'adaptive' with the effort parameter (Opus 4.7+), or budget_tokens on older models
  • OpenAI o-series: reasoning.effort set to low, medium, or high
  • Gemini: thinking_config.thinking_budget (set to 0 to disable entirely)

If you're working in ChatGPT, Claude.ai, or the Gemini web interface, you don't have those knobs. But you can control reasoning verbosity at the prompt level.

The Prompt (for simple tasks on reasoning models):

Answer directly without deliberation. Do not second-guess or reconsider.

[Your actual question here]

Why This Works: It explicitly suppresses the self-reflection behavior that triggers waste tokens. The model skips the "Wait, let me reconsider..." loops and moves straight to the answer. This mirrors what NoWait does at the logit level, but through instruction-following instead.

Expected Output:

The model produces a direct answer in 50-200 tokens instead of 1,000-5,000. Accuracy stays the same or improves on simple tasks, because the model doesn't talk itself out of the correct first answer.

For harder problems where you want some reasoning but not runaway deliberation:

The Prompt (for complex tasks):

Think through this carefully but be concise in your reasoning. State your conclusion once you reach it -- do not re-derive or second-guess.

[Your complex question here]

Why This Works: "Carefully but concise" keeps useful reasoning steps intact while discouraging the redundant re-derivation loops. "State your conclusion once you reach it" prevents the model from entering the overthinking zone where it abandons correct answers.

Expected Output:

A focused chain of reasoning that reaches its conclusion in 500-1,500 tokens instead of 5,000-10,000. The key logical steps are present, but the "Hmm, let me reconsider... actually wait..." filler is gone.

And the batch prompting trick is surprisingly effective:

The Prompt (batch approach):

Answer all three questions below. Be direct and concise for each.

1. [Question A]
2. [Question B]  
3. [Question C]

Why This Works: Research on batch prompting (arXiv 2511.04108) shows that grouping multiple questions in one call reduces reasoning tokens by up to 76% with no accuracy loss. The model can't enter a self-doubt spiral on each individual question when it has three to handle.

Expected Output:

Three focused answers, each 100-300 tokens, totaling less than what a single question would have generated with full reasoning overhead.

The decision framework

Match the model to the task difficulty:

  • Simple extraction, classification, formatting: Use a non-reasoning model (Haiku, GPT-4o-mini, Gemini Flash). Or use a reasoning model with thinking disabled / effort set to low. Don't pay for cognition you don't need.
  • Multi-step analysis, coding, math: Use a reasoning model with medium effort. Add "be concise in your reasoning" to your prompt.
  • Novel research, complex debugging, hard math proofs: Use full reasoning. This is where thinking tokens earn their cost.

The research is clear: for most production workloads, you're spending 3-10x more than necessary on reasoning tokens. The fix is knowing when your task actually needs that reasoning and when the model is just talking to itself.

Where the research is headed

The NoWait paper handles inference-time suppression. "Do Thinking Tokens Help or Trap?" proposes DuP-PO, a training-time method that teaches models when to think and when not to. Other approaches are stacking up: ThinkLess (arXiv 2505.15684) terminates reasoning early, ALP (arXiv 2506.05256) applies adaptive length penalties during training, and a full survey paper in TMLR tracks dozens more. This is now a research subfield with over 20 papers in 12 months.

The direction is clear. Future reasoning models will calibrate their thinking depth automatically. Until then, it's on you to set the budget.

If your team wants hands-on training for prompt engineering techniques like reasoning budget control, connect with Kief Studio on Discord or schedule a session. We run workshops that turn research like this into production prompting patterns your team can use the same day.

Training

Want your team prompting like this?

Kief Studio runs hands-on prompt engineering workshops tailored to your stack and workflows.

Newsletter

Get techniques in your inbox.

New prompt engineering guides delivered weekly. No spam, unsubscribe anytime.

Subscribe