long-horizon agent prompting July 2, 2026 • 6 min read

Your Agent's Planner Is Fine. Its Aggregator Is What's Broken

ROMA hit state-of-the-art on SealQA by compressing every subtask tree before it bubbles up. Here are the four node prompts you can copy without adopting the framework.

Your agent works great for the first dozen steps. Then it starts repeating itself, forgetting what it already found, and confidently answering a question nobody asked. You blame the plan. You rewrite the planner prompt for the fifth time. It doesn't help.

The plan was never the problem. The problem is that every step dumps its raw output back into one running context, and by step 15 the model is drowning in its own transcript. That's context explosion, and no planner prompt fixes it.

A team at Sentient (with researchers from Virginia Tech, Berkeley, UC San Diego, and Maryland) built a framework called ROMA that treats this as the core failure mode instead of an afterthought. The paper formalizing the results landed in early 2026 (arXiv:2602.01848); the framework itself shipped in October 2025. The headline number is the one worth staring at.

Same model, triple the score

ROMA's base model, GLM-4.6, scores 14.5% on Seal-0, a benchmark that makes agents reason over conflicting web evidence. Wrap that exact same model in ROMA's structure and it hits 45.9%. That's roughly 3.1x, and the model never changed. No fine-tuning, no bigger window, no better base. Just structure.

For context on how good 45.9% is: Kimi-Researcher, the strongest prior open research agent, gets 36.0%. Perplexity Deep Research, the best closed system tested, gets 31.5%. Gemini 2.5 Pro lands at 19.8%. A scaffold turned a weak open model into the best result on the board.

If your planner were the bottleneck, better planning would close that 14.5% to 45.9% gap. It doesn't. Something downstream of the plan does. That something is the Aggregator.

The trick: verify and compress before you return

ROMA is a recursive loop of four node types. A task comes in. A gate decides whether it's small enough to just do, or big enough to break down. If it breaks down, each piece runs (possibly breaking down further), and then, before any result travels back up to the parent, an Aggregator node compresses and re-answers.

That last part is the whole game. The Aggregator does not hand the parent a pile of raw child outputs. It produces the answer to the parent's original question and throws the rest away. The parent's context stays bounded because it only ever receives a finished, verified summary, never the mess that produced it.

Compare that to what most teams do: let the full transcript grow, then summarize the whole thing when it gets too long. That naive compaction is lossy in ways you can't predict. A 2026 safety benchmark found that summarization-based compaction dropped rule-violation rates from 0% (rule in full context) to 30% on average, and up to 59% for some models. Summarizing a global transcript silently deletes constraints. ROMA's per-subtree verification sidesteps this because each Aggregator checks its result against a local goal it can actually hold in view.

There's a counterintuitive point buried here. More retained context is worse, not better. Independent 2026 research measured accuracy dropping between 13.9% and 85% as context grows, even when every relevant fact stays present. A full 1M-token window of noise loses to a curated 100K window. The goal is minimum sufficient context, not maximum. "Just buy a bigger window" is the wrong instinct.

The four nodes you can copy today

ROMA ships these as typed DSPy signatures, not verbose system prompts. That's good news, because it means the structure is small enough to transcribe into any agent loop by hand. Here are the four contracts as prompts you can paste.

1. The Atomizer decides whether to decompose or just execute. Run it on a cheap model at temperature 0 so the classification is deterministic.

You are a task classifier. Given a goal and its context, decide
whether the goal is ATOMIC (can be completed in a single execution
step with the tools available) or requires PLANNING (must be broken
into subtasks first).

Goal: {goal}
Context: {context}

Return JSON:
{
  "is_atomic": true | false,
  "node_type": "EXECUTE" | "PLAN",
  "reason": "one sentence"
}

Bias toward EXECUTE. Only choose PLAN if the goal genuinely needs
multiple independent steps or tools.

Why This Works: Deciding "should I break this down" before spending tokens is what stops runaway decomposition. Cheap model, temperature 0, binary output. It's a gate, not a thinker.

Expected Output:

{"is_atomic": false, "node_type": "PLAN", "reason": "Comparing three vendors requires separate retrieval per vendor before synthesis."}

2. The Planner turns a non-atomic goal into a MECE subtask graph. MECE means mutually exclusive, collectively exhaustive: no two subtasks overlap, and together they fully cover the goal. That discipline is what lets independent branches run in parallel instead of stepping on each other.

Break the goal into subtasks that are Mutually Exclusive (no overlap)
and Collectively Exhaustive (together they fully cover the goal).

Goal: {goal}
Context: {context}

For each subtask assign a task_type from exactly this set:
RETRIEVE, WRITE, THINK, CODE, IMAGE.

Return JSON:
{
  "subtasks": [
    {"id": "s1", "goal": "...", "task_type": "RETRIEVE"},
    {"id": "s2", "goal": "...", "task_type": "THINK"}
  ],
  "dependencies": {"s2": ["s1"]}
}

The dependencies map lists which subtasks must finish before each one
starts. Subtasks with no shared dependency will run in parallel, so
make sure they do not need each other's output.

Why This Works: The dependency map is what enables sibling parallelism. Forcing MECE up front means the branches are genuinely independent, so you can fan them out without one branch needing a result another branch is still computing.

Expected Output:

{"subtasks": [{"id":"s1","goal":"Retrieve pricing for vendor A","task_type":"RETRIEVE"}, {"id":"s2","goal":"Retrieve pricing for vendor B","task_type":"RETRIEVE"}, {"id":"s3","goal":"Compare and recommend","task_type":"THINK"}], "dependencies": {"s3":["s1","s2"]}}

3. The Executor does one atomic subtask. It can be an LLM call, an API hit, or another agent. Keep its budget tight.

Complete this single task. Use tools as needed. Do not attempt work
outside the stated goal.

Goal: {goal}
Context: {context}

Return JSON:
{
  "output": "the result",
  "sources": ["url or tool call reference", ...]
}

Why This Works: One task, explicit source tracking, a hard scope boundary. In ROMA's software-engineering profile this node runs ReAct for up to 15 tool iterations under a 64K token budget and a 300 second timeout. It's the heaviest node, so bounding it matters.

4. The Aggregator is the node that saves you. It compresses child results into an answer to the parent's goal, not a transcript of what the children did.

Several subtasks have completed. Produce the answer to the ORIGINAL
parent goal. Do not list the subtask outputs. Synthesize them into a
single result that directly answers the parent goal, and drop any
detail the parent does not need.

Original parent goal: {original_goal}
Subtask results: {subtasks_results}
Context: {context}

Then verify your own answer:
- Does it fully address the original goal? If not, say what is missing.
- Is any claim unsupported by the subtask results? If so, remove it.

Return JSON:
{
  "synthesized_result": "the compressed answer to the parent goal",
  "verified": true | false,
  "gaps": "what is still missing, or empty"
}

Why This Works: This is the compression step that keeps the parent's context from exploding. By re-answering the original goal instead of concatenating children, the parent receives a bounded result no matter how much work happened below. The self-verification block is ROMA's optional Verifier folded in: it checks the aggregate against the local goal before anything travels upward, which is exactly the check that whole-transcript summarization skips.

What actually does the work

ROMA tunes the exact wording of these prompts automatically with a genetic optimizer, so don't treat the phrasing above as sacred. The structure is what matters, and you now have it: a gate that decides before spending, a planner that emits parallel-safe branches, executors on tight budgets, and an Aggregator that compresses and verifies before returning.

Wire those four into a recursive loop and your context stops growing linearly with every step. The parent only ever sees finished answers. That's the difference between an agent that rots at step 15 and one that holds together at step 50.

Start with the Aggregator. It's the one node most homegrown agent loops are missing entirely, and it's the one carrying that 3.1x.

If your team is building long-horizon agents and watching them fall apart past the first few steps, we run live prompt engineering training that covers exactly this kind of recursive decomposition and context control. Connect with Kief Studio on Discord or schedule a session.

Training

Want your team prompting like this?

Kief Studio runs hands-on prompt engineering workshops tailored to your stack and workflows.

Newsletter

Get techniques in your inbox.

New prompt engineering guides delivered weekly. No spam, unsubscribe anytime.

Subscribe