Tool-Use Prompt Design That Prevents Hallucinated Function Calls in Production
Why your agent invents tools that don't exist and the three-line fix that stops it
Your agent just called a function that doesn't exist. The schema was valid JSON. The parameter types were correct. The function name was completely made up.
This is tool-call hallucination, and it's the failure mode that kills agentic deployments faster than anything else. Schema validation won't catch it. Your tests won't catch it either, because the output looks right. It parses. It just references a tool your system never registered.
Why Models Invent Tools
LLMs don't have a registry of your tools. They have a context window with tool descriptions in it, and they predict the next token. When the model needs to call a function, it's pattern-matching against those descriptions. Give it twenty similar tools and it starts blending them. Give it a task where no tool fits and it invents one that sounds like it should.
Research from the "LLM-based Agents Suffer from Hallucinations" survey (arXiv 2509.18970, September 2025) found that tool selection errors increase with tool count, especially among similar tools. The counterintuitive part: training on more successful tool-call examples made hallucination rates worse, not better. Models overfit on happy-path trajectories where the right tool always exists. They never learn to say "I don't have a tool for that."
The other common pattern is parameter fabrication. The model needs an email address it doesn't have, so it generates one. It needs a user ID, so it picks a plausible-looking number. The function call is structurally perfect and semantically garbage. Giskard's security testing documented this pattern across production chatbots: book_hotel(guests=15) when the tool's docstring says maximum 10 guests. The model treats natural-language constraints as suggestions.
The Three Layers That Actually Fix This
The industry is converging on a three-layer defense. Each layer is a few lines of code. Each catches a different class of failure. You need all three.
Layer 1: Schema Enforcement (Structural)
This is the table stakes layer. OpenAI's strict: true mode uses a context-free grammar engine to mask invalid tokens before generation. The model literally cannot produce a response that violates your JSON Schema. Claude, Gemini, and other providers have equivalent mechanisms.
Schema enforcement eliminates structural hallucinations: wrong types, missing required keys, invalid enum values. GPT-5.2's complex JSON reliability jumped from roughly 82% to over 92% with strict mode enabled.
Here's the problem. Most teams stop here. They ship strict: true and assume they're covered. They're not.
A schema-valid call like {"tool": "transfer_money", "amount": 50000, "account": "1234567890"} passes every structural check. The account number is hallucinated. The amount might exceed policy limits. The schema doesn't know and doesn't care.
Layer 2: Semantic Validation (Content)
This layer checks that parameter values are grounded in conversation context. It's where you enforce business rules, preconditions, and referential integrity.
The Prompt (system-level tool definition):
You have access to the following tools. You MUST only use tools listed here.
If no tool matches the user's request, respond in plain text explaining what
you cannot do. NEVER invent tool names or parameters.
## send_email
Sends an email to a verified contact.
- recipient_email: MUST be an email address explicitly provided by the user
in this conversation. NEVER generate or guess email addresses.
- subject: string, max 200 characters
- body: string, max 5000 characters
PRECONDITIONS: Do not call send_email unless the user has provided a
recipient address in a previous message. If they say "email John" without
an address, ask for John's email first.
Why This Works: The precondition block converts implicit assumptions into explicit rules the model can follow. "MUST be an email address explicitly provided by the user" gives the model a clear decision boundary. It doesn't need to guess whether fabricating an address is acceptable. The instruction to respond in plain text when no tool matches gives the model an explicit escape hatch instead of forcing it to pick the closest tool.
Expected Output (when user says "email John about the meeting"):
I'd be happy to send that email. What's John's email address?
Without the precondition, the model would generate something like send_email(recipient_email="[email protected]", ...) and your system would either bounce the email or, worse, send it to a real stranger.
Layer 3: Runtime Guardrails (Post-Call)
This layer validates after the model produces a tool call but before your system executes it. Think of it as an assertion layer.
The Prompt (validation wrapper):
You are a tool-call validator. Given a proposed function call and the
conversation history, check three things:
1. TOOL EXISTS: Is the function name in the registered tool list? If not,
reject with "UNKNOWN_TOOL".
2. VALUES GROUNDED: Is every parameter value either (a) explicitly stated in
the conversation, (b) a reasonable default documented in the tool schema,
or (c) computed from grounded values? If any value appears fabricated,
reject with "UNGROUNDED_PARAMETER" and name the field.
3. PRECONDITIONS MET: Are all documented preconditions for this tool
satisfied by the conversation state? If not, reject with
"PRECONDITION_FAILED" and state which one.
Respond with either APPROVED or REJECTED: {reason}.
Why This Works: A second LLM pass (or a deterministic validator if your tool set is simple enough) catches what schema enforcement misses. The three-check structure gives the validator a concrete rubric instead of an open-ended "is this ok?" question. Cleanlab's work on the Tau-Bench benchmark showed that even simple trust scoring with fallback strategies cut agent failure rates by up to 50% with no model changes.
Expected Output (for a fabricated account number):
REJECTED: UNGROUNDED_PARAMETER -- "account" value "1234567890" does not appear in conversation history and is not a documented default.
The Deny-List Pattern for Tool Selection
When your agent has access to many tools, you need to constrain which ones it considers for a given task. Explicit deny-lists outperform vague instructions.
The Prompt:
For this task, you may ONLY use these tools: search_contacts, send_email.
Do NOT use: delete_contact, modify_account, export_data.
If the user's request requires a tool not in the allowed list, explain that
you cannot perform that action in this context.
Why This Works: Naming forbidden tools is more effective than just listing allowed ones. The model's attention mechanism weighs explicitly mentioned items heavily. When you say "do NOT use delete_contact," the model is less likely to confuse it with a similarly-named allowed tool. This is the same principle behind prompt injection defenses: explicit denial outperforms implicit exclusion.
MCP Tool Descriptions: Keep Them Tight
If you're building with the Model Context Protocol, tool descriptions deserve special attention. Palo Alto's Unit 42 and Microsoft both documented attacks where tool descriptions contain hidden prompt payloads. A malicious MCP server provides a tool description that injects instructions into your agent's context.
The defense is simple: treat tool descriptions as untrusted input. Keep your own descriptions short, declarative, and free of instructions that could be confused with system prompts. One sentence describing what the tool does. Parameter descriptions that state constraints, not suggestions. No narrative, no examples in the description itself. Put examples in your system prompt where you control the context.
What This Looks Like Together
Your production tool-use setup needs three things:
- Schema enforcement on every tool call. Use
strict: true, Pydantic with the Instructor library, or Zod on the TypeScript side. This is one line of configuration. - Precondition blocks in every tool definition. State what must be true before the tool runs. State where parameter values must come from. This is two to three lines per tool.
- A validation pass between model output and tool execution. Deterministic checks for tool existence and type conformance. LLM-based or rule-based checks for value grounding. This is a function call wrapper.
The Instructor library for Python (over 3 million monthly downloads) handles layers 1 and 3 together: Pydantic validation with automatic retry when the model produces invalid output. It's the most adopted production pattern for this problem.
None of this is complicated. Each layer is a few lines of code. But skipping any one of them leaves a class of hallucination completely unchecked. Schema enforcement without semantic validation is a false sense of security. Semantic validation without runtime guardrails is a hope-based architecture.
Your agent will try to invent tools. Make sure your system says no before anything runs.
Want hands-on training on building reliable agentic systems for your team? Connect with Kief Studio on Discord or schedule a session.