
Delete Your Temperature Parameter: Why Self-Consistency Sampling Is Dead on Reasoning Models Like Gemini 3.5 Flash
Google's migration guide quietly killed the majority-vote trick you've leaned on for years. Here's what replaces it.
If you ported a project to gemini-3.5-flash and left temperature in your config, you have dead code. The parameter still parses. It does nothing.
When Gemini 3.5 Flash went generally available on May 19, 2026, Google's migration guide gave a blunt instruction: remove temperature, top_p, and top_k from your config. They aren't deprecated with a warning. They're silently ignored. Google's reasoning is short: "Gemini 3's reasoning capabilities are optimized for the default settings."
That one change breaks a technique a lot of us have shipped for years. Self-consistency sampling depends on a knob that no longer turns.
What self-consistency actually needed
Self-consistency, from Wang et al. in 2022, is simple and it worked. You ask the model to reason through a problem step by step. Then you run that same prompt many times at a non-zero temperature, collect the different answers, and take the majority vote.
The whole thing hinges on diversity. At temperature 0, the model is near-deterministic and you get the same trace every time. Voting on twenty identical answers tells you nothing. So you crank temperature to 0.5 or 1.0 to force the model down different reasoning paths, then let the consensus answer win.
The gains were real. On GSM8K math problems, greedy chain-of-thought scored 51.7%. Sample 30 paths at temperature 1.0 and majority-vote, and accuracy climbed to roughly 68%. For a long time, that was one of the best free upgrades in prompting.
Now read that again with the migration note in mind. The technique requires you to set temperature. On Gemini 3.5 Flash, you can't. The diversity knob is gone, so the mechanism has nothing to turn.
Why Google removed the knob
This isn't Google being difficult. Reasoning models already do internally what self-consistency did externally.
Gemini 3's Deep Think mode is described as parallel reasoning: the model fans out across multiple hypotheses at once instead of plodding down a single thought path, then converges on an answer. The fan-out-and-vote loop you used to orchestrate by hand now happens inside the model. Bolting your own sampling layer on top is redundant at best and, with temperature ignored, simply inert.
There's supporting evidence beyond Google's word. A 2026 study on automated scoring found that increasing ensemble size from 1 sample to 7 produced no significant accuracy gains, while raising reasoning effort showed a clear positive trend. More votes stopped paying. More thinking kept paying.
The replacement: thinking_level
The control you actually have now is thinking_level. It replaces the old integer thinking_budget token count with a string enum: minimal, low, medium, and high. On Gemini 3.5 Flash the default is medium.
That default matters more than it looks. During the preview window, gemini-3-flash-preview defaulted to a higher reasoning level. If you lift-and-shift to gemini-3.5-flash and don't set thinking_level explicitly, your app reasons less than it did before. Nobody changed the prompt, but the outputs got subtly worse. If you run client workloads, that's a quality regression waiting to land in someone's inbox.
One more trap: you can't send both thinking_budget and thinking_level in the same request. The API rejects it outright. Pick the new one.
Here's the before and after, with a verifiable-answer task as the example.
The Prompt (old self-consistency approach):
# Config: temperature=1.0, top_p=0.95, sampled 20 times, majority vote in app code
You are solving a math competition problem. Think step by step,
show all work, then state your final answer on a line beginning
with "ANSWER:".
Problem: A train leaves station A at 60 mph. A second train leaves
station B, 280 miles away, 30 minutes later at 80 mph, heading
toward A. How many miles from A do they meet?
The Prompt (migrated for Gemini 3.5 Flash):
# Config: ThinkingConfig(thinking_level="high"), single call, no temperature/top_p/top_k
Solve this math competition problem. State your final answer on a
line beginning with "ANSWER:".
Problem: A train leaves station A at 60 mph. A second train leaves
station B, 280 miles away, 30 minutes later at 80 mph, heading
toward A. How many miles from A do they meet?
Why This Works:
You drop the manual "think step by step" scaffolding because the reasoning model does that internally, and you replace twenty sampled calls plus app-side vote-tallying with a single call at thinking_level="high". One request instead of twenty, no voting logic to maintain, and the model's native parallel reasoning carries the load the temperature trick used to.
Expected Output:
[Internal reasoning at high thinking level: first train travels 30 min before the second departs, covering 30 miles, leaving 250 miles between them. Combined closing speed 140 mph. Time to meet: 250/140 hours. Distance from A adds the head start back in.]
ANSWER: 132.86 miles from A (approximately 132 6/7).
Notice what's gone from the code, not just the prompt. No sampling loop. No temperature. No parser collecting twenty answers into a Counter. The orchestration you maintained got deleted along with the parameter.
When fan-out voting still earns its keep
"Self-consistency is dead" is too strong if you say it about everything. It's dead on reasoning models, as a temperature trick. It's alive elsewhere, and knowing the line is the actual skill.
Use this rule.
If you're on a reasoning model (Gemini 3.x, the o-series) and the answer is verifiable, drop temperature and voting. Set thinking_level and let the model reason. Move it from medium to high for hard problems.
If you're stuck on a cheap or non-reasoning model you genuinely can't replace, and the answer is verifiable, self-consistency still helps. But don't run fixed-N voting. Use an adaptive variant. Adaptive-Consistency cut sample counts by about 7.9x with under 0.1% accuracy loss by stopping early once the vote converged. Fixed 20-sample loops are wasteful even where voting works.
If the output has no single correct answer, voting was never the right tool. Majority vote needs comparable answers to tally: a number, a class label, a diagnosis. Vote among twenty poems and you don't find the best one, you find the most common one. For creative or open-ended work, use a judge model with a rubric or a human, regardless of which model generated it.
One caution worth keeping honest: cheap models were always the weakest case for self-consistency. The original paper noted gains are lower for smaller models, because capabilities like reliable arithmetic emerge at scale. The folk wisdom of "small model, so vote more" is shakier than it sounds. The clean remaining win for fan-out is non-reasoning models you can't swap out, on verifiable answers, with an adaptive stopping rule.
What to do this week
Grep your configs for temperature, top_p, and top_k on any Gemini 3.x call and delete them. Replace thinking_budget with thinking_level, and set it explicitly so you don't inherit the silent drop to medium. If you had self-consistency orchestration around a reasoning model, tear it out and raise thinking_level instead. If you had it around a cheap model you're keeping, swap fixed-N for an adaptive variant.
The parameter is gone, but the instinct behind it was right: spend more compute on hard problems. The spend just moved from your sampling loop into the model's reasoning budget.
If your team is sorting out which prompting techniques survive the move to reasoning models and which are now dead weight, we run live, hands-on prompt engineering training. Connect with Kief Studio on Discord or schedule a session.
Training
Want your team prompting like this?
Kief Studio runs hands-on prompt engineering workshops tailored to your stack and workflows.
Newsletter
Get techniques in your inbox.
New prompt engineering guides delivered weekly. No spam, unsubscribe anytime.
Subscribe
