semi-formal reasoning prompt June 29, 2026 • 6 min read

Stop Trusting Free-Form Chain-of-Thought. Make Your Model Sign a Logical Certificate Instead

Meta's semi-formal reasoning template forces premises, traced paths, and a derived conclusion -- so you can see exactly where the model is bluffing.

The most dangerous output an LLM gives you isn't the obviously wrong one. It's the confident, well-formatted, wrong one. The answer that reads like it was written by someone who knows what they're talking about, lands cleanly, and survives a skim review because nothing about it looks off.

"Think step by step" makes this worse, not better. You ask for chain-of-thought, the model narrates a tidy sequence of reasoning, and you walk away trusting the conclusion because the steps looked like steps. But narration is cheap. A model can produce a fluent five-paragraph rationale for a verdict it pulled out of thin air, and the rationale will read exactly as well as one backed by real evidence.

There's a better pattern, and Meta just published hard numbers on it.

The receipt, not the story

In March 2026, Shubham Ugare and Satish Chandra at Meta released a paper on what they call semi-formal reasoning (arXiv 2603.01896). The setup: ask an agent to verify whether two versions of a code patch are equivalent, without running the code.

Standard agentic reasoning got 78% on their curated examples. The semi-formal version hit 88%, and 93% on real agent-generated patches. On a separate code Q&A benchmark, Claude Opus 4.5 jumped to 87% accuracy, a gain of nearly 11 points over a single-shot answer. On fault localization, up to 12 points.

The difference wasn't a smarter model or a bigger context window. It was a form. Instead of letting the model freestyle, they made it fill out a logical certificate before reaching a verdict.

A certificate has four required fields:

  • Definitions and Scope. What do these functions, inputs, and tests actually mean? Pin down the terms before reasoning about them.
  • Premises. The specific assumptions the model is allowed to use. Nothing else. If a fact isn't a premise, it can't appear in the conclusion.
  • Traced Paths. Step-by-step traces of how control and data flow actually move, per specific case. This is the load-bearing field. It forces the model to follow what the code does instead of guessing from what a function is named.
  • Conclusion. A crisp verdict. And when the verdict is "not equivalent," a concrete counterexample that proves it.

The design goal is completeness. The agent "cannot skip cases or make unsupported claims." And that's the whole trick: a bluffing model hand-waves right where the Traced Paths box sits empty. The blank is where the bluff becomes visible.

Same task, two prompts

Here's what the lazy version looks like.

The naive prompt:

Here are two versions of a function. Are they equivalent?
Think step by step.

[function A]
[function B]

You'll get something like: "Let's think step by step. Both functions take a list and return a filtered result. Version A uses a loop, Version B uses a comprehension. They appear to perform the same filtering logic. Therefore they are equivalent." Confident. Readable. And it never actually traced a single input through both functions. If B silently drops the last element on an empty edge case, this answer sails right past it.

Now the certificate version.

The prompt:

Determine whether function A and function B are equivalent.
Do NOT guess from function names or structure. Complete every
field below. If you cannot fill a field with concrete evidence,
write "INSUFFICIENT" rather than asserting.

DEFINITIONS & SCOPE
- What each function takes, returns, and mutates.
- What "equivalent" means for this pair (same output? same
  side effects? same behavior on errors?).

PREMISES
- List only the assumptions you are allowed to use.
- Mark any assumption you cannot verify from the code as
  UNVERIFIED.

TRACED PATHS
- Pick at least 3 concrete inputs, including edge cases
  (empty, null, boundary, duplicate).
- For each input, trace the actual execution of BOTH functions
  line by line and record the value at each step.

CONCLUSION
- Verdict: EQUIVALENT or NOT EQUIVALENT.
- If NOT EQUIVALENT, give one concrete input where the outputs
  differ, with both outputs shown.

Why This Works: The Traced Paths field makes narration impossible. The model can't write "they appear to perform the same logic" because the form demands actual values at each step for specific inputs, and the Conclusion is only allowed to draw on what those traces produced. You've turned "convince me" into "show me the receipt."

Expected Output:

Definitions & Scope: Both take a list of integers and return a list. A mutates nothing; B mutates nothing. "Equivalent" here means identical return value for identical input. Premises: Input is always a list (may be empty). Elements are integers. No concurrency. Traced Paths: - Input []: A returns [] (loop body never runs). B returns [] (comprehension over empty). Match. - Input [2, 4, 5]: A appends 2, 4, skips 5 -> [2, 4]. B yields 2, 4 -> [2, 4]. Match. - Input [1, 1, 2]: A returns [2] (A uses > not >=). B returns [1, 1, 2] (B uses >=). Differ. Conclusion: NOT EQUIVALENT. Counterexample: [1, 1, 2] returns [2] from A and [1, 1, 2] from B.

The naive prompt would have called these equivalent. The certificate caught the > versus >= bug because the form refused to let the model skip the duplicate-value case.

The honest catch

Two things you need to hear before you treat this as magic.

First, it only helped the stronger model. Opus 4.5 made big gains. Sonnet 4.5 essentially flatlined with or without the template. A certificate is a forcing function, not a fixer. It makes a capable model do the work it was skipping. It can't manufacture reasoning ability that isn't there. If your base model can't trace execution, no form will save you.

Second, it costs. About 2.8 times more reasoning steps. You're buying accuracy with inference-time compute, and that means latency and dollars. The paper didn't measure the token bill, but you will feel it. This is a tool for high-stakes calls, not for every prompt you fire off.

And the deepest caveat, stated plainly: the certificate doesn't make the answer true. It makes a wrong answer auditable. When the model is wrong now, it's wrong in a specific premise or a specific trace you can point at and check. That's the actual win. You've moved from "trust the fluent paragraph" to "verify the line that's supposed to be load-bearing."

This isn't just for code

The pattern generalizes to any analysis where a polished wrong answer costs you. A contract-risk read. A competitive pricing comparison. A compliance check. A "should we take this client" memo. Anywhere you'd otherwise get a confident summary with no way to audit it.

Here's the consumer-grade certificate for non-coders. Drop it into any high-stakes prompt:

Answer in this exact structure:
MAIN FINDING: one sentence.
SUPPORTING EVIDENCE: each claim as a direct quote or specific
  reference from the source material. No claim without a source.
CONFIDENCE: 0 to 10, and why.
WEAKEST POINTS: the 3 places this conclusion is most likely
  wrong, and what would prove it wrong.

Same mechanism. The evidence field blocks vague generality. The confidence score forces the model to commit. The weakest-points field is self-refutation, which is the cheapest red-team you'll ever run. Models optimize for plausibility, and when you remove the constraints, plausibility collapses into polite generalities. A certificate template is the constraint.

None of this replaces a human reading the trace. It just means that when you read it, the bluff has nowhere to hide.

If your team keeps shipping decisions on the back of confident, unverifiable AI output, this is the habit worth drilling. Want hands-on training on structured reasoning and verification prompting for your team? Connect with Kief Studio on Discord or schedule a session.

Training

Want your team prompting like this?

Kief Studio runs hands-on prompt engineering workshops tailored to your stack and workflows.

Newsletter

Get techniques in your inbox.

New prompt engineering guides delivered weekly. No spam, unsubscribe anytime.

Subscribe