When should you split work across multiple AI agents?

Split only when one of six criteria is met: the capability needs persistent private state, requires 10-15+ tools, is parallelizable AND single-agent success is below 45%, needs adversarial review dynamics, requires structured intermediate outputs between phases, or needs different model capabilities (vision vs code vs long-context).

What is the best architecture for multi-agent AI systems?

Research points to a thin routing layer over specialized peers with structured communication, not hierarchical command chains. The router classifies tasks and selects execution graphs, then steps back. Agents communicate through a shared structured state (context bus) rather than relaying messages through a central 'boss' agent. MetaGPT's structured handoffs scored 3.9/4 vs unstructured 2.1/4.

← Back to Blog

3 April 2026 · 15 min read · By Mark Laursen

Why Your Multi-Agent AI System Keeps Failing (And What the Research Actually Says)

Q: Why do multi-agent AI systems fail?

Multi-agent LLM systems fail 41-87% of the time, with 79% of failures coming from coordination breakdowns rather than model capability. The MAST study (NeurIPS 2025) found coordination issues account for 36.9% of all failures, the single largest category. The same model that fails in a multi-agent setup often succeeds when running alone.

Q: How many AI agents should you use?

Research shows 3 optimized agents outperform 7 on the same tasks (DyLAN, COLM 2024). Pruning from 7 to 4 agents improved performance while cutting token costs by 52.9-67.8%. Google Research found coordination gains plateau at 3-4 agents. The right number is almost always fewer than you think.

AI multi-agent systems LLM orchestration agents research open source

Editor’s note (2026-06-16): Maestro has since moved to a Frontier-led model. The headline is now the Frontier Engine, a local multi-CLI fusion engine that fans a prompt across Opus 4.8, GPT-5.5, and Gemini 3.1 Pro and synthesizes one grounded answer; the discipline layer this piece describes is its proven foundation. The argument below stands as written; only the positioning around it has changed.

In December 2025, Google Research published the largest empirical study of multi-agent LLM systems to date. 180 configurations. Four benchmarks. Five architectures. Three model families. The results confirmed what practitioners had been suspecting: most multi-agent systems perform worse than a single agent doing the same work.

Not slightly worse. Measurably, reproducibly worse.

This was not an isolated finding. Two months earlier, the MAST study (NeurIPS 2025 Spotlight) had analyzed 1,642 execution traces across seven major multi-agent frameworks and found failure rates between 41% and 86.7%. The largest single failure category? Coordination breakdowns at 36.9%, more than any other cause.

The failures are architectural, not intelligence-level. The same model that fails in a multi-agent setup often succeeds when running alone. That is worth sitting with for a moment. Adding more agents to a problem can make a capable model perform worse.

I spent the last year studying this literature, compiling over 700 sources across computer science, library science, safety engineering, and knowledge theory. This post covers what I found, why it matters, and what I built as a result.

Failure Category	Share	Source
Coordination Breakdowns	36.9%	MAST, NeurIPS 2025
Specification + Other Issues	42.1%	MAST, NeurIPS 2025
Model Capability	21.0%	MAST, NeurIPS 2025

What does the research say about how multi-agent systems scale?

Google Research’s December 2025 study produced three empirical scaling laws. First, tasks involving 16 or more tools suffer disproportionately from multi-agent overhead. Second, coordination yields diminishing returns once a single agent handles the task at roughly 45% success. Third, architecture determines error amplification: independent agents amplify errors 17.2 times, centralized coordination contains amplification to 4.4 times.

1. The Tool-Coordination Tradeoff

Tasks involving 16 or more tools suffer disproportionately from multi-agent overhead. This was the strongest predictor in the entire model, with a 57% larger effect than the next strongest variable. When agents need many tools, adding more agents makes the problem worse, not better. The tools themselves become a coordination challenge.

If your domain involves CRM, email, enrichment APIs, analytics, compliance databases, and a dialer, you are already past the 16-tool threshold. Splitting that across multiple agents compounds the confusion rather than distributing it.

2. Capability Saturation at 45%

Multi-agent coordination yields diminishing or negative returns once a single agent can handle the task class at roughly 45% success rate. Below that threshold, coordination helps. Above it, the overhead eats the gains.

The practical implication: before splitting work across agents, measure how well one agent handles it. If it is already decent, adding more agents will not help. Multi-agent coordination is a solution for tasks that are genuinely too hard for one agent, not a general-purpose performance boost.

3. Topology-Dependent Error Amplification

Independent agents with no coordination amplify errors 17.2x. Centralized coordination (one planner directing specialists) contains error amplification to 4.4x. The architecture determines how fast errors cascade.

Architecture	Error Amplification	Source
Independent Agents (no coordination)	17.2x	Google Research 2025
Centralized Coordination (planner + specialists)	4.4x	Google Research 2025
Single Agent (baseline)	1x	Google Research 2025

Never deploy independent agents without a coordination layer. Even a lightweight central coordinator dramatically reduces error amplification. Fully decentralized swarm architectures without structure are the worst-performing pattern in the research.

Three Agents Beat Seven

DyLAN (COLM 2024) demonstrated that an optimized team of three agents outperforms a team of seven on the same tasks. Pruning from seven to four agents improved performance while cutting token costs by 52.9-67.8%.

The mechanism is an Agent Importance Score, inspired by neuron importance scoring in neural networks. Agents that do not measurably contribute to outcomes get pruned. The system gets better by getting smaller.

Combined with the Google Research finding that coordination gains plateau beyond 3-4 agents, the message is clear: the right number of agents is almost always fewer than you think.

When does multi-agent coordination help, and when does it hurt?

Multi-agent coordination helps on highly parallel tasks (research, financial analysis, data enrichment), where Google Research measured an 80.9% improvement over single-agent baselines. It hurts on sequential reasoning tasks, where every multi-agent variant degraded performance by 39 to 70%. The research does not say multi-agent systems are useless; it says the right architecture is domain-structure-dependent.

Highly parallel tasks like research, financial analysis, and data enrichment benefit significantly from centralized multi-agent coordination. Google Research measured an 80.9% improvement over single-agent baselines on parallelizable work.

Sequential reasoning tasks like planning, proof construction, and step-by-step workflows should stay single-agent. Every multi-agent variant degraded performance by 39-70% on sequential reasoning. This is not a marginal effect.

Tool-heavy domains with 16+ tools need single-agent execution or very structured delegation with explicit tool ownership per agent. Shared tool access compounds the coordination problem.

Creative and review tasks benefit from adversarial dynamics where agents with different roles challenge each other’s output. A single agent cannot credibly argue both sides.

Open-ended exploration is best served by Voyager’s skill library pattern (NeurIPS 2023): executable skills indexed by semantic embedding, retrieved via similarity search, and composed into complex behaviors. New skills are added through an automatic curriculum where the system identifies what it cannot do and learns it. This is the strongest model for capability organization in the literature.

When should you split work into a separate agent?

Split only when one of six criteria is met: the capability requires persistent private state, needs 10 to 15 or more tools, is parallelizable with single-agent success below 45%, requires adversarial review dynamics, produces structured intermediate outputs between phases, or needs a different model capability (vision vs. code vs. long-context). If none of these criteria are met, the capability belongs in a skill library, not a new agent.

Persistent private state required across interactions. If the capability is stateless (inputs to outputs), it is a sub-skill or tool. If it maintains state that other agents should not access or modify, it may need isolation. This is Goldberg’s (2024) minimal differentiating property of true multi-agency.
Fundamentally different toolset that would exceed 10-15 tools per agent. Tool confusion is a primary trigger for splitting. An agent managing CRM, email, enrichment, analytics, dialers, and compliance databases is holding too many tools in working context.
Parallelizable AND single-agent success below 45%. Both conditions must be true. If the task is parallelizable but one agent already handles it well, do not split. If it is parallelizable AND one agent struggles, multi-agent coordination helps up to 80.9%.
Adversarial or review dynamics needed. When two roles have different incentive structures (coder vs. reviewer, sales vs. compliance, writer vs. editor), they should be separate agents. A single agent cannot credibly argue both sides.
Structured intermediate outputs needed between phases. MetaGPT’s SOP-driven approach with structured handoff artifacts scored 3.9/4 vs ChatDev’s 2.1/4. When the quality of the handoff between two phases matters enough to define a formal schema, that is often a natural agent boundary.
Different model capabilities required. If one task needs vision and another needs code generation and another needs long-context reasoning, they may need different underlying models and therefore different agents.

What is the right role for the orchestrator in a multi-agent system?

The orchestrator should be a thin router, not a boss. It classifies the inbound task, selects the right execution graph, initiates execution, and steps back. The most common anti-pattern is a “manager” agent that evaluates and overrides subordinate agents, creating the exact bottleneck multi-agent systems were supposed to eliminate.

MAST’s failure analysis found that coordination breakdowns are the single largest failure category. The central relay pattern, where every message passes through a boss agent that reinterprets it, is the primary driver. Each relay step introduces interpretation drift.

The research-backed alternative: the router classifies the inbound task, selects the right execution graph, initiates execution, and then steps back. Sub-skills execute as peers, communicating through a shared structured state (context bus) rather than relaying messages through a central agent. The router never rewrites output, never overrides domain judgments, and never acts as a communication relay.

This is a fundamentally different model from hierarchical orchestration. There is no boss. There is a switchboard operator.

Contribution-Based Influence Replaces Role-Based Authority

In hierarchical systems, the “senior” agent always has authority regardless of whether its decisions improve outcomes. This is the org-chart fallacy applied to LLMs.

DyLAN’s Agent Importance Score and SELFORG’s Shapley-based contribution estimation point to a different model: influence earned by measured outcomes, not assigned by role. Sub-skills that consistently improve task completions get higher weight in graph construction. Sub-skills that add noise get pruned. No sub-skill is permanently dominant.

Authority flows to wherever value is actually created, and shifts when conditions change.

Structured Communication, Not Free-Form Messaging

MetaGPT’s SOP-driven approach with structured intermediate outputs between agents scored 3.9 out of 4 in quality evaluations. ChatDev’s unstructured approach scored 2.1. Almost double the quality from structured handoffs alone.

The principle: when work passes between agents or sub-skills, the handoff must be a structured artifact with a defined schema, not a free-text summary. Without structure, the receiving agent must interpret the sending agent’s output, and that interpretation drift is the mechanism behind 36.9% of all multi-agent failures.

A structured context bus where sub-skills read and write defined fields eliminates the interpretation step entirely. The schema is the architecture. If it is not in the schema, sub-skills cannot communicate it. This prevents the echo chamber problem where fully connected topologies with free-form messaging flood each other with unstructured noise.

Task-Adaptive Graphs: Matching Complexity to Coordination

Not every task needs every sub-skill activated. Running a full coordination mesh on a simple task wastes tokens and introduces unnecessary overhead. Running a minimal graph on a complex task drops accuracy because critical sub-skills are not participating.

DyLAN showed up to 25% accuracy improvement from task-adaptive topology selection alone. The system should select from at least three tiers:

Simple (2-3 sub-skills, linear chain) for routine tasks with clear input. Sequential handoff, no lateral communication needed.

Standard (3-5 sub-skills, parallel branches) for moderately complex tasks with some ambiguity. Parallel sub-skills read and write the context bus simultaneously.

Complex (all sub-skills, full lateral mesh) for edge cases and conflicting signals. Every sub-skill active, resolving conflicts in real-time through the bus.

The topology itself should evolve. Track which tier produces the best outcomes for which task types. If the complex graph consistently matches the simple graph’s results for a certain class, demote it to save tokens. If the simple graph produces failures for a certain class, promote it. The graph adapts to measured reality.

Designing for Failure

GTD (Guided Topology Diffusion, 2025) showed only 0.3% accuracy degradation under agent failure with redundant topologies, versus 13% for fixed structures. Resilience is a structural property, not a feature bolted on after the fact.

Every sub-skill should produce valid partial output independently. If one sub-skill fails, the graph degrades to a simpler tier rather than halting. The context bus persists state, so even if the router restarts, sub-skills with active bus connections continue operating in degraded mode. And degraded operations are logged and surfaced, never silently swallowed.

The Domain Decomposition Problem

Building good agents requires knowing what they need to do. This is harder than it sounds.

The research on multi-dimensional domain decomposition, drawing from Ranganathan’s faceted classification (1933) in library science, Knowledge Space Theory (Doignon & Falmagne, 1999), and cognitive task analysis (Crandall, Klein & Hoffman, 2006), reveals that professional domains are not trees. They are multi-dimensional coordinate systems.

Ranganathan’s core insight was that any subject can be decomposed along independent, orthogonal axes. The Classification Research Group later expanded his five categories to thirteen. Spiteri’s 1998 simplification established the operational test: each axis must be differentiating, relevant, ascertainable, permanent, homogeneous, and mutually exclusive. The Cartesian product of all axis values generates the complete space of possible descriptions.

Applied to agent construction, this means: instead of asking “what should this agent do?” (which produces a flat list), you identify the independent axes of a domain (entities, activities, knowledge types, stakeholders, lifecycle phases, context, tools, governance) and populate each independently. The intersections of those axes define specific capabilities. Each populated intersection is a potential agent skill.

Completeness verification then borrows from safety engineering: HAZOP guide words (NO, MORE, AS WELL AS, OTHER THAN, PART OF) applied systematically to every category in the decomposition, pre-mortem analysis generating 25-30% more relevant concerns than direct questioning, and corpus-based coverage checking against job postings, certifications, and domain literature to flag unmapped terms.

This is not theoretical. I built this methodology into a working system grounded in 700+ sources and used it to construct production agents.

What did I build to put these research findings into practice?

I released Maestro, a zero-dependency multi-agent orchestrator that drops into Claude Code, Codex, or Cursor as a single file. It implements the architecture the research points to: a decision gate biased toward single-agent, a planner for complex decomposition, specialist agents capped at four per parallel group, cross-talk routing, and adversarial staff engineer review.

Maestro implements the architecture the research points to:

Decision Gate that evaluates whether a task actually needs multiple agents, biased toward single-agent to prevent unnecessary coordination overhead
Planner Agent that decomposes complex tasks into parallel and sequential work
Specialist Agents with a hard ceiling of 4 per parallel group, based on the Google Research and DyLAN findings on coordination plateaus
Cross-Talk Routing for communication between specialists when outputs affect one another
Staff Engineer Review that performs adversarial final verification

The orchestrator handles routing exclusively. It does not plan, does not perform work, and does not review output. That separation prevents the coordination failures that MAST identified as the dominant failure class.

Desktop implementations achieve wall-clock speedups through parallel execution. Cloud deployments remain token-efficient for complex tasks where a single agent would need 15+ sequential messages. The system ships as CLAUDE.md for Claude Code, AGENTS.md for tool-agnostic compatibility, and .cursorrules for Cursor.

The Research Papers

For anyone who wants to go deeper, these are the studies that informed this work:

Paper	Year	Venue	Key Finding
MAST	2025	NeurIPS Spotlight	41-87% failure rates; 79% from coordination issues; 36.9% coordination breakdowns
DyLAN	2024	COLM	3 agents outperform 7; 25% accuracy improvement from dynamic topology
Google Research Multi-Agent Scaling Study	2025	arXiv	3 scaling laws: tool-coordination tradeoff, 45% saturation, error amplification
Voyager	2023	NeurIPS	Skill library: executable skills indexed by embedding, composed dynamically
GTD (Guided Topology Diffusion)	2025	arXiv	0.3% degradation under failure with redundant topologies vs 13% for rigid
SELFORG	2025	arXiv	Self-organizing communication graphs; Shapley-based contribution estimation
MetaGPT	2023-2024		SOP-driven structured handoffs: 3.9/4 vs unstructured 2.1/4
Ranganathan Faceted Classification	1933		Domains as orthogonal axes, not hierarchies; Cartesian product completeness
Knowledge Space Theory	1999		Prerequisite dependency lattices for domain knowledge states
HAZOP (IEC 61882)			Systematic perturbation guide words for completeness verification
EVE Framework	2025		Multi-query union improves recall up to 24% over single prompts

The One-Sentence Version

Build the simplest architecture that works, split only when quantitative criteria demand it, communicate through structured schemas, match coordination topology to task complexity, and evolve everything through measured outcomes.

Maestro is open source on GitHub. MIT licensed. Zero dependencies. Drop it in and go.

Frequently asked questions

Why do multi-agent AI systems fail so often?

The MAST study (NeurIPS 2025) analyzed 1,642 execution traces across seven major frameworks and found failure rates between 41% and 86.7%. Coordination breakdowns account for 36.9% of failures, the single largest category. The same model that fails in a multi-agent setup often succeeds running alone; the problem is architectural, not a capability gap.

How many agents should a production multi-agent system use?

Almost always fewer than you think. DyLAN (COLM 2024) demonstrated that three optimized agents outperform seven on the same tasks. Google Research found coordination gains plateau at 3 to 4 agents. Pruning from seven to four improved performance while cutting token costs by 52.9 to 67.8%. Treat “do I need a fifth agent?” as a question that should almost always be answered “no, redesign one of the four.”

What is the best way to handle communication between agents?

Structured handoff artifacts with a defined schema, not free-text summaries. MetaGPT’s SOP-driven approach with structured intermediate outputs scored 3.9 out of 4 in quality evaluations. ChatDev’s unstructured messaging scored 2.1. Without structure, the receiving agent must interpret the sender’s output, and that interpretation drift is the mechanism behind 36.9% of all multi-agent failures.

How do I debug a multi-agent system that keeps failing?

Start by distinguishing coordination failures from capability failures. If the same model succeeds on the task alone but fails in the multi-agent setup, the problem is the handoff schema, the relay pattern, or the tool surface, not the model. Log every handoff artifact and compare what Agent A produced against what Agent B received. In my experience, the failure modes at that boundary follow predictable patterns: paraphrased context losing citations, implicit contracts producing unexpected field values, and boss agents introducing interpretation drift at every relay step.

Disclosure: I am the creator of Maestro. This post is grounded in peer-reviewed research that I studied independently before building Maestro, but I have a direct interest in the architectural approach it implements. The research papers are linked above; read them and draw your own conclusions.

Mark Laursen

Advisor, founder, and executive producer with 25+ years building technology companies, gaming platforms, and entertainment products. Based in Portugal.

LinkedIn GitHub