diffusion LLM prompting June 15, 2026 • 6 min read

Chain-of-Thought Assumes Left-to-Right. Mercury 2 Doesn't: How to Prompt a Diffusion LLM by Handing It a Skeleton, Not a Trail

Diffusion models like Inception's new Mercury 2 refine the whole answer in parallel, so scaffold the output shape instead of nudging a linear reasoning chain.

"Think step by step" works because the model writes one token at a time. Each token sees everything that came before it. By the time the model reaches its answer, it has a written trail of reasoning to lean on. That trail is the whole trick.

Mercury 2 throws that assumption out.

Inception Labs shipped Mercury 2 in late February 2026, the first reasoning diffusion LLM. It doesn't decode left to right. It starts with a fully masked draft of the entire answer and unmasks all positions at once, refining the whole thing over a handful of adaptive denoising steps. Every token is conditioned on a bidirectional view of the draft, not just the tokens before it.

So there is no "early reasoning then late reasoning" chain. The model can't condition its conclusion on a reasoning trail it wrote first, because it's writing the conclusion and the reasoning in the same parallel pass. This is why freeform chain-of-thought partially breaks here, and why a different technique works better.

Why the trail stops working

When you prompt GPT-5.5 or Claude 4.8 with "reason through this carefully before answering," you're exploiting autoregression. The reasoning tokens become context for the answer tokens. The chain is causal.

A diffusion model denoises the reasoning and the answer together. There's no guarantee the reasoning gets committed before the answer forms. Ask it to "think out loud first," and you often get reasoning text that doesn't actually constrain the final output, because the final output wasn't downstream of it.

There's hard data behind this. A 2026 study, "The Bitter Lesson of Diffusion Language Models for Agentic Workflows," found diffusion models running past 150 tok/s but scoring embodied-task success below 2%, while comparable autoregressive models handled tool-calling far better. The authors pin it on the architecture: parallel decoding weakens causal dependency and produces fuzzy intermediate states, so the model struggles to commit to partial plans. The same property that kills left-to-right CoT also undermines step-by-step agentic commitment.

That sounds like bad news. It isn't, if you stop fighting the architecture.

Prompt the shape, not the path

Diffusion models are good at one thing autoregressive models do awkwardly: filling masked slots inside a predefined structure. The model sees the whole template at once, so it fills every slot with the rest of the structure in view. Give it a skeleton with labeled, empty slots and it denoises content into them.

Researchers formalized this as Template Infilling: generate the structural template, then fill the masked segments. On math and code benchmarks it beat the baseline by 17 points. The practical version for you is simpler. Don't ask for a reasoning journey. Hand the model the exact output structure you want and let it populate it.

Here's a prompt that underperforms on Mercury 2:

The Prompt (the one that struggles):

A customer wrote in angry that their order shipped late and arrived
damaged. Think step by step about the situation, work through what
went wrong, reason about the best response, and then write a reply.

Why This Falls Short: It asks for a linear reasoning trail. The diffusion model has no mechanism to make the reply depend on reasoning it wrote "earlier," because nothing is earlier. You get vague reasoning and a reply only loosely tied to it.

Now the same task, scaffolded:

The Prompt (the one that works):

A customer wrote in angry that their order shipped late and arrived
damaged. Fill in this exact structure. Keep every header. Replace the
bracketed placeholders with your content.

ISSUE_SUMMARY: [one sentence stating the two problems]
ROOT_CAUSE: [most likely cause of each problem, one line each]
WHAT_WE_OWE: [concrete remedies: refund, reship, or both]
REPLY_TO_CUSTOMER:
[3-4 sentence reply. Apologize once. State the remedy. No corporate filler.]
TONE_CHECK: [confirm the reply is not defensive, in 5 words or less]

Why This Works: The model sees every slot at once and fills them so they cohere. ROOT_CAUSE and WHAT_WE_OWE constrain REPLY_TO_CUSTOMER through the shared structure rather than through a causal chain, which is exactly what diffusion is built to do. You're using bidirectional conditioning on purpose.

Expected Output:

ISSUE_SUMMARY: The order arrived both late and physically damaged. ROOT_CAUSE: Late shipment likely a carrier handoff delay; damage points to inadequate packaging for fragile contents. WHAT_WE_OWE: Full refund of shipping plus a free replacement shipped expedited. REPLY_TO_CUSTOMER: I'm sorry. Your order showed up late and damaged, and that's on us. We're shipping a replacement today by expedited freight and refunding your shipping charge in full. You'll have a tracking number within the hour. TONE_CHECK: Direct, accountable, not defensive.

Same model, same task. The skeleton version produces tighter, more usable output because it plays to how the model actually generates.

Make the structure machine-readable when you can

The skeleton idea scales up. Mercury 2 supports schema-aligned JSON output natively, and diffusion models are well suited to infilling masked fields in JSON, YAML, or XML. If your downstream code expects a fixed shape, give the model that shape with empty fields and let it fill them. You skip the brittle parsing you'd need with a freeform autoregressive answer.

One caveat worth a quick test before you build a pipeline on it. Some diffusion models won't infill a prompt-supplied template out of the box, because their fine-tuning only masked the response, not the prompt. Send your skeleton, confirm the model honors it and doesn't rewrite your headers, then commit.

When to still reach for the trail

Don't retire chain-of-thought. The honest tradeoff: Mercury 2 runs around 1,000 tok/s, independently clocked up to 1,196, roughly 5 to 10 times faster than speed-optimized models like Claude 4.5 Haiku or GPT-5.2 Mini. But its intelligence sits in that same Haiku-class tier, not at the frontier. It scored 33 on the Artificial Analysis Intelligence Index, 22nd of 132 models. The speed is the headline. The reasoning depth is not.

So the split is about the job, not which architecture won.

Reach for autoregressive CoT (GPT-5.5, Claude 4.8) when the task needs genuine multi-step reasoning, long agentic tool chains, or a conclusion that must depend on intermediate work. That causal trail is a feature you're paying for.

Reach for diffusion scaffolding when latency is the constraint and the task fits a known shape: drafting, rewrites, summarization, structured extraction, classification, voice-loop replies. Natural voice turn-taking wants responses under 500ms, and a three-call autoregressive step at 200 tok/s can eat 7 to 8 seconds. Diffusion's fixed-step denoising gives you predictable p95 latency, so reasoning-grade output fits a real-time budget.

Two more things to keep you honest. Time-to-first-token is actually slightly worse on diffusion, because the first step processes the whole sequence. It wins on total completion time, not first byte, so it shines when you show the complete response rather than streaming it. And the speed edge is largest when you own the GPU. On a hosted pay-per-token endpoint you keep the quality deficit but give back much of the practical speed gain.

The pattern that's winning in production is hybrid. Frontier autoregressive model on the planner where reasoning depth matters, diffusion on the narrow, fast, latency-bound sub-tasks where shape matters more than depth. Prompt each one the way it actually generates: a trail for the autoregressive planner, a skeleton for the diffusion worker.

If your team is moving real workloads onto diffusion models and wants to get the prompting patterns right the first time, Kief Studio runs hands-on prompt engineering training. Connect with us on Discord or schedule a session.

Training

Want your team prompting like this?

Kief Studio runs hands-on prompt engineering workshops tailored to your stack and workflows.

Newsletter

Get techniques in your inbox.

New prompt engineering guides delivered weekly. No spam, unsubscribe anytime.

Subscribe