The Supervisor-Worker Agent Pattern: One Smart Model Plans, Cheap Models Execute

How to architect multi-agent systems that cost 90% less than running Opus on everything

HxHippy

28 Mar 2026 — 6 min read

Most multi-agent systems fail. Not sometimes. Not edge cases. A study of 1,642 execution traces across seven open-source frameworks found failure rates between 41% and 87%. Google DeepMind tested 180 configurations and found that unstructured multi-agent networks amplify errors up to 17x compared to a single agent working alone.

So why is this article about building a multi-agent system?

Because one specific pattern actually works. The supervisor-worker topology -- one smart model plans, cheap models execute -- is the exception to the multi-agent failure rate. It's also the pattern that turned a $0.93/task cost into $0.05/task while improving success rates.

The economics are simple math

Here's what you're paying per million tokens right now:

Model	Input	Output
Claude Opus 4.6	$5.00	$25.00
Claude Sonnet 4.6	$3.00	$15.00
Claude Haiku 4.5	$1.00	$5.00

Opus to Haiku is a 5x price gap on input, 5x on output. If 80% of your workload is execution (formatting, extraction, classification, simple generation) and 20% is planning (decomposition, routing, quality checks), you're paying Opus prices for Haiku-level work on four out of five calls.

BudgetMLAgent, published at AIMLSystems 2024 by TCS Research, demonstrated a 94.2% cost reduction by applying this exact split. Their system used cheap models for worker execution, cascaded to expensive models only on failure, and reserved "expert lifelines" for when the planner got stuck. Success rates went from 22.7% to 33% -- cheaper and better.

Why the hierarchy matters

The DeepMind study is the key piece most people miss. When you throw multiple agents at a problem without structure -- what the researchers call a "bag of agents" -- performance degrades past four agents. Agents leave 85% of extra compute budget untouched. They don't coordinate. They duplicate work. They contradict each other.

The supervisor-worker pattern prevents this because information flows one direction: down from the planner, up from the workers. The supervisor decomposes the task, assigns specific subtasks with clear acceptance criteria, and evaluates results. Workers never talk to each other. There's no negotiation, no consensus-building, no circular dependency.

Anthropic's own multi-agent research system uses this topology. Their orchestrator decomposes queries and spawns parallel subagents. It achieves 90.2% performance improvement over single-agent on their internal evals. Their finding worth memorizing: upgrading the orchestrator model produces larger gains than doubling the token budget. Put your money in the brain, not the hands.

Building it: the supervisor prompt

The supervisor's job is decomposition and routing. Here's how to prompt it.

The Prompt:

You are a task supervisor. You receive a user request and decompose it
into independent subtasks that can be executed by worker agents.

For each subtask, output a JSON object with:
- "task_id": sequential integer
- "instruction": a complete, self-contained instruction for a worker
  agent. The worker has NO context beyond this instruction.
- "model": "haiku" for straightforward execution tasks (extraction,
  formatting, classification, simple generation). "sonnet" for tasks
  requiring reasoning, judgment, or synthesis. Never assign "opus".
- "depends_on": list of task_ids that must complete before this task
  starts. Empty list if independent.
- "acceptance_criteria": how to verify the output is correct

Rules:
- Each instruction must be fully self-contained. Include all necessary
  context, data, and constraints IN the instruction.
- Prefer independent tasks over sequential chains.
- Never create more than 6 subtasks. If the request is simple enough
  for 1-2 subtasks, use 1-2.
- If the entire request can be handled in a single call, return a
  single subtask.

Output only valid JSON array. No commentary.

User request: {user_request}

Why This Works: The prompt forces the supervisor to make three decisions per subtask: what to do, which model tier handles it, and what "done" looks like. The self-containment rule is critical -- workers with partial context produce garbage. The cap at six subtasks prevents the coordination overhead that kills larger agent networks.

Expected Output:

json [ { "task_id": 1, "instruction": "Extract all company names, founding dates, and headquarters locations from the following text. Return as a JSON array of objects with keys 'company', 'founded', 'hq'. Text: [full text included here]", "model": "haiku", "depends_on": [], "acceptance_criteria": "Valid JSON array. Each object has all three keys. Dates in YYYY format." }, { "task_id": 2, "instruction": "Write a 200-word competitive analysis comparing these companies based on the following structured data: [data from task 1]. Focus on market positioning and growth trajectory.", "model": "sonnet", "depends_on": [1], "acceptance_criteria": "Under 250 words. References specific data points. No generic filler." } ]

Building it: the orchestration loop

The orchestrator sits between the supervisor and workers. It handles dependency resolution, parallel dispatch, and failure cascading. Here's a minimal implementation using the Anthropic SDK.

The Prompt (system prompt for each worker):

You are a worker agent. Execute the following task exactly as
described. Do not add commentary, preamble, or explanation beyond
what the task requests.

If you cannot complete the task with the information provided, respond
with: {"status": "failed", "reason": "specific reason here"}

Otherwise, respond with: {"status": "complete", "result": your output}

Why This Works: Workers get a structured contract: succeed with a result, or fail with a reason. The failure signal flows back to the orchestrator, which can retry with a stronger model or report the failure to the supervisor for re-planning. No ambiguity, no partial results that look like success.

Here's the Python orchestration logic:

import anthropic
import json
from concurrent.futures import ThreadPoolExecutor, as_completed

client = anthropic.Anthropic()

MODEL_MAP = {
    "haiku": "claude-haiku-4-5-20251001",
    "sonnet": "claude-sonnet-4-6-20260320",
    "opus": "claude-opus-4-6-20260320",
}

WORKER_SYSTEM = """You are a worker agent. Execute the following task
exactly as described. No commentary beyond what the task requests.
If you cannot complete the task, respond with:
{"status": "failed", "reason": "specific reason"}
Otherwise respond with: {"status": "complete", "result": your output}"""


def plan(user_request: str) -> list[dict]:
    """Supervisor decomposes the request into subtasks."""
    response = client.messages.create(
        model=MODEL_MAP["sonnet"],
        max_tokens=2048,
        system=SUPERVISOR_SYSTEM_PROMPT,  # the prompt from above
        messages=[{"role": "user", "content": user_request}],
    )
    return json.loads(response.content[0].text)


def execute_task(task: dict, results: dict) -> dict:
    """Run a single worker task, injecting dependency results."""
    instruction = task["instruction"]
    for dep_id in task["depends_on"]:
        dep_result = results[dep_id]["result"]
        instruction += f"\n\nContext from prior task: {dep_result}"

    response = client.messages.create(
        model=MODEL_MAP[task["model"]],
        max_tokens=4096,
        system=WORKER_SYSTEM,
        messages=[{"role": "user", "content": instruction}],
    )
    output = json.loads(response.content[0].text)

    # Cascade: retry with stronger model on failure
    if output["status"] == "failed" and task["model"] == "haiku":
        task["model"] = "sonnet"
        return execute_task(task, results)

    return {**output, "task_id": task["task_id"]}


def run(user_request: str) -> dict:
    """Full supervisor-worker execution loop."""
    tasks = plan(user_request)
    results = {}
    pending = {t["task_id"]: t for t in tasks}

    while pending:
        # Find tasks whose dependencies are all satisfied
        ready = [
            t for t in pending.values()
            if all(d in results for d in t["depends_on"])
        ]
        with ThreadPoolExecutor(max_workers=len(ready)) as pool:
            futures = {
                pool.submit(execute_task, t, results): t
                for t in ready
            }
            for future in as_completed(futures):
                result = future.result()
                results[result["task_id"]] = result
                del pending[result["task_id"]]

    return results

The key design decisions: parallel dispatch for independent tasks, serial execution for dependencies, and automatic model cascading on failure. The supervisor runs on Sonnet (not Opus -- Sonnet is good enough for decomposition in most cases). Workers default to Haiku. Failed Haiku tasks retry on Sonnet before reporting failure upstream.

The plan caching trick

If your workload is repetitive -- and most production workloads are -- you can cut costs another 46% with plan caching. The idea: store successful task decompositions as templates. On new requests, check if a cached plan matches. Cache hit? Skip the supervisor entirely and route straight to workers.

The Prompt:

Compare the following user request against these cached plan
templates. If one matches with >90% structural similarity, return
its template_id. If none match, return "no_match".

Consider a match when: the request type is the same, the number and
nature of subtasks would be identical, and only the specific data
values differ.

Cached templates:
{templates_json}

New request: {user_request}

Why This Works: This turns the matching decision into a cheap classification task -- Haiku can handle it. On a match, you skip the Sonnet planning call entirely. Research on this approach shows 96.67% accuracy retention while cutting 46.6% of costs. For agencies running similar analyses across clients, or pipelines processing the same document types daily, most requests hit the cache after the first week.

Where this breaks down

Don't use this pattern for everything. Single-turn questions don't need decomposition -- you're adding latency and cost for no benefit. Tasks requiring tight back-and-forth iteration (like debugging a specific error) work better with one capable model that holds full context.

The sweet spot, based on Anthropic's tested configurations: 2-5 worker agents handling 5-6 tasks each. More than five agents and coordination overhead eats your gains. Fewer than three tasks per agent and the spawning cost isn't justified.

Also watch your token budget. Anthropic's multi-agent research system uses roughly 15x more tokens than standard single-agent chat. The per-token cost is lower, but total token volume is higher. Run the math for your specific workload before committing.

The real win

The supervisor-worker pattern isn't just a cost play. It's the topology that prevents the coordination failures that kill most multi-agent systems. One decision-maker, clear task boundaries, structured information flow, automatic fallback. The 94% cost reduction is the hook. The reliability is the reason you actually adopt it.

If your team wants to build multi-agent systems that work in production -- not just in demos -- Kief Studio runs hands-on training on agent architecture, prompt engineering for orchestration, and cost optimization patterns. Connect on Discord or book a session.