Skip to content
Back to Blog

3 April 2026 · 12 min read · By Mark Laursen

Why Your Multi-Agent AI System Keeps Failing (And What the Research Actually Says)

In December 2025, Google DeepMind published the largest empirical study of multi-agent LLM systems to date. 180 configurations. Four benchmarks. Five architectures. Three model families. The results confirmed what practitioners had been suspecting: most multi-agent systems perform worse than a single agent doing the same work.

Not slightly worse. Measurably, reproducibly worse.

This was not an isolated finding. Two months earlier, the MAST study (NeurIPS 2025 Spotlight) had analyzed 1,642 execution traces across seven major multi-agent frameworks and found failure rates between 41% and 86.7%. The largest single failure category? Coordination breakdowns at 36.9%, more than any other cause.

The failures are architectural, not intelligence-level. The same model that fails in a multi-agent setup often succeeds when running alone. That is worth sitting with for a moment. Adding more agents to a problem can make a capable model perform worse.

I spent the last year studying this literature, compiling over 700 sources across computer science, library science, safety engineering, and knowledge theory. This post covers what I found, why it matters, and what I built as a result.

WHERE MULTI-AGENT FAILURES COME FROM (MAST, NeurIPS 2025) Coordination Breakdowns 36.9% Specification + Other Issues 42.1% Model Capability 21.0%
Failure CategoryShareSource
Coordination Breakdowns36.9%MAST, NeurIPS 2025
Specification + Other Issues42.1%MAST, NeurIPS 2025
Model Capability21.0%MAST, NeurIPS 2025

The Three Scaling Laws from DeepMind

The DeepMind study produced three empirical scaling laws that should inform every multi-agent design decision.

1. The Tool-Coordination Tradeoff

Tasks involving 16 or more tools suffer disproportionately from multi-agent overhead. This was the strongest predictor in the entire model, with a 57% larger effect than the next strongest variable. When agents need many tools, adding more agents makes the problem worse, not better. The tools themselves become a coordination challenge.

If your domain involves CRM, email, enrichment APIs, analytics, compliance databases, and a dialer, you are already past the 16-tool threshold. Splitting that across multiple agents compounds the confusion rather than distributing it.

2. Capability Saturation at 45%

Multi-agent coordination yields diminishing or negative returns once a single agent can handle the task class at roughly 45% success rate. Below that threshold, coordination helps. Above it, the overhead eats the gains.

The practical implication: before splitting work across agents, measure how well one agent handles it. If it is already decent, adding more agents will not help. Multi-agent coordination is a solution for tasks that are genuinely too hard for one agent, not a general-purpose performance boost.

3. Topology-Dependent Error Amplification

Independent agents with no coordination amplify errors 17.2x. Centralized coordination (one planner directing specialists) contains error amplification to 4.4x. The architecture determines how fast errors cascade.

ERROR AMPLIFICATION BY ARCHITECTURE (DeepMind 2025) Independent Agents (no coordination) 17.2x Centralized Coordination (planner + specialists) 4.4x Single Agent (baseline) 1x
ArchitectureError AmplificationSource
Independent Agents (no coordination)17.2xDeepMind 2025
Centralized Coordination (planner + specialists)4.4xDeepMind 2025
Single Agent (baseline)1xDeepMind 2025

Never deploy independent agents without a coordination layer. Even a lightweight central coordinator dramatically reduces error amplification. Fully decentralized swarm architectures without structure are the worst-performing pattern in the research.

Three Agents Beat Seven

DyLAN (COLM 2024) demonstrated that an optimized team of three agents outperforms a team of seven on the same tasks. Pruning from seven to four agents improved performance while cutting token costs by 52.9-67.8%.

The mechanism is an Agent Importance Score, inspired by neuron importance scoring in neural networks. Agents that do not measurably contribute to outcomes get pruned. The system gets better by getting smaller.

Combined with the DeepMind finding that coordination gains plateau beyond 3-4 agents, the message is clear: the right number of agents is almost always fewer than you think.

PERFORMANCE vs AGENT COUNT Performance Number of Agents 1 2 3 4 5 6 7 OPTIMAL Peak: 3 agents Coordination overhead exceeds gains

Where Multi-Agent Actually Helps (And Where It Hurts)

The research does not say multi-agent systems are useless. It says the right architecture is domain-structure-dependent. The wrong architecture is one-size-fits-all.

Highly parallel tasks like research, financial analysis, and data enrichment benefit significantly from centralized multi-agent coordination. DeepMind measured an 80.9% improvement over single-agent baselines on parallelizable work.

Sequential reasoning tasks like planning, proof construction, and step-by-step workflows should stay single-agent. Every multi-agent variant degraded performance by 39-70% on sequential reasoning. This is not a marginal effect.

Tool-heavy domains with 16+ tools need single-agent execution or very structured delegation with explicit tool ownership per agent. Shared tool access compounds the coordination problem.

Creative and review tasks benefit from adversarial dynamics where agents with different roles challenge each other’s output. A single agent cannot credibly argue both sides.

Open-ended exploration is best served by Voyager’s skill library pattern (NeurIPS 2023): executable skills indexed by semantic embedding, retrieved via similarity search, and composed into complex behaviors. New skills are added through an automatic curriculum where the system identifies what it cannot do and learns it. This is the strongest model for capability organization in the literature.

The Six Criteria for Splitting an Agent

The combined research produces exactly six criteria that justify creating a separate agent instead of keeping a capability as a sub-skill. If none of these criteria are met, the capability belongs in a skill library, not a new agent.

  1. Persistent private state required across interactions. If the capability is stateless (inputs to outputs), it is a sub-skill or tool. If it maintains state that other agents should not access or modify, it may need isolation. This is Goldberg’s (2024) minimal differentiating property of true multi-agency.

  2. Fundamentally different toolset that would exceed 10-15 tools per agent. Tool confusion is a primary trigger for splitting. An agent managing CRM, email, enrichment, analytics, dialers, and compliance databases is holding too many tools in working context.

  3. Parallelizable AND single-agent success below 45%. Both conditions must be true. If the task is parallelizable but one agent already handles it well, do not split. If it is parallelizable AND one agent struggles, multi-agent coordination helps up to 80.9%.

  4. Adversarial or review dynamics needed. When two roles have different incentive structures (coder vs. reviewer, sales vs. compliance, writer vs. editor), they should be separate agents. A single agent cannot credibly argue both sides.

  5. Structured intermediate outputs needed between phases. MetaGPT’s SOP-driven approach with structured handoff artifacts scored 3.9/4 vs ChatDev’s 2.1/4. When the quality of the handoff between two phases matters enough to define a formal schema, that is often a natural agent boundary.

  6. Different model capabilities required. If one task needs vision and another needs code generation and another needs long-context reasoning, they may need different underlying models and therefore different agents.

The Router is a Facilitator, Not a Boss

The most common multi-agent anti-pattern is a “manager” agent that evaluates and overrides subordinate agents. This creates the exact bottleneck that multi-agent systems were supposed to eliminate.

ANTI-PATTERN vs RESEARCH-BACKED ARCHITECTURE Hierarchical Boss (Anti-Pattern) Boss Agent Agent A Agent B Agent C Every message relayed through boss Interpretation drift at each step Single point of failure Thin Router + Peers (Research-Backed) Router selects graph, then steps back Context Bus (shared state) Peer A Peer B Peer C Peers read/write structured fields No interpretation drift No single point of failure

MAST’s failure analysis found that coordination breakdowns are the single largest failure category. The central relay pattern, where every message passes through a boss agent that reinterprets it, is the primary driver. Each relay step introduces interpretation drift.

The research-backed alternative: the router classifies the inbound task, selects the right execution graph, initiates execution, and then steps back. Sub-skills execute as peers, communicating through a shared structured state (context bus) rather than relaying messages through a central agent. The router never rewrites output, never overrides domain judgments, and never acts as a communication relay.

This is a fundamentally different model from hierarchical orchestration. There is no boss. There is a switchboard operator.

Contribution-Based Influence Replaces Role-Based Authority

In hierarchical systems, the “senior” agent always has authority regardless of whether its decisions improve outcomes. This is the org-chart fallacy applied to LLMs.

DyLAN’s Agent Importance Score and SELFORG’s Shapley-based contribution estimation point to a different model: influence earned by measured outcomes, not assigned by role. Sub-skills that consistently improve task completions get higher weight in graph construction. Sub-skills that add noise get pruned. No sub-skill is permanently dominant.

Authority flows to wherever value is actually created, and shifts when conditions change.

Structured Communication, Not Free-Form Messaging

MetaGPT’s SOP-driven approach with structured intermediate outputs between agents scored 3.9 out of 4 in quality evaluations. ChatDev’s unstructured approach scored 2.1. Almost double the quality from structured handoffs alone.

The principle: when work passes between agents or sub-skills, the handoff must be a structured artifact with a defined schema, not a free-text summary. Without structure, the receiving agent must interpret the sending agent’s output, and that interpretation drift is the mechanism behind 36.9% of all multi-agent failures.

A structured context bus where sub-skills read and write defined fields eliminates the interpretation step entirely. The schema is the architecture. If it is not in the schema, sub-skills cannot communicate it. This prevents the echo chamber problem where fully connected topologies with free-form messaging flood each other with unstructured noise.

Task-Adaptive Graphs: Matching Complexity to Coordination

Not every task needs every sub-skill activated. Running a full coordination mesh on a simple task wastes tokens and introduces unnecessary overhead. Running a minimal graph on a complex task drops accuracy because critical sub-skills are not participating.

DyLAN showed up to 25% accuracy improvement from task-adaptive topology selection alone. The system should select from at least three tiers:

Simple (2-3 sub-skills, linear chain) for routine tasks with clear input. Sequential handoff, no lateral communication needed.

Standard (3-5 sub-skills, parallel branches) for moderately complex tasks with some ambiguity. Parallel sub-skills read and write the context bus simultaneously.

Complex (all sub-skills, full lateral mesh) for edge cases and conflicting signals. Every sub-skill active, resolving conflicts in real-time through the bus.

The topology itself should evolve. Track which tier produces the best outcomes for which task types. If the complex graph consistently matches the simple graph’s results for a certain class, demote it to save tokens. If the simple graph produces failures for a certain class, promote it. The graph adapts to measured reality.

Designing for Failure

GTD (Guided Topology Diffusion, 2025) showed only 0.3% accuracy degradation under agent failure with redundant topologies, versus 13% for fixed structures. Resilience is a structural property, not a feature bolted on after the fact.

Every sub-skill should produce valid partial output independently. If one sub-skill fails, the graph degrades to a simpler tier rather than halting. The context bus persists state, so even if the router restarts, sub-skills with active bus connections continue operating in degraded mode. And degraded operations are logged and surfaced, never silently swallowed.

The Domain Decomposition Problem

Building good agents requires knowing what they need to do. This is harder than it sounds.

The research on multi-dimensional domain decomposition, drawing from Ranganathan’s faceted classification (1933) in library science, Knowledge Space Theory (Doignon & Falmagne, 1999), and cognitive task analysis (Crandall, Klein & Hoffman, 2006), reveals that professional domains are not trees. They are multi-dimensional coordinate systems.

Ranganathan’s core insight was that any subject can be decomposed along independent, orthogonal axes. The Classification Research Group later expanded his five categories to thirteen. Spiteri’s 1998 simplification established the operational test: each axis must be differentiating, relevant, ascertainable, permanent, homogeneous, and mutually exclusive. The Cartesian product of all axis values generates the complete space of possible descriptions.

Applied to agent construction, this means: instead of asking “what should this agent do?” (which produces a flat list), you identify the independent axes of a domain (entities, activities, knowledge types, stakeholders, lifecycle phases, context, tools, governance) and populate each independently. The intersections of those axes define specific capabilities. Each populated intersection is a potential agent skill.

Completeness verification then borrows from safety engineering: HAZOP guide words (NO, MORE, AS WELL AS, OTHER THAN, PART OF) applied systematically to every category in the decomposition, pre-mortem analysis generating 25-30% more relevant concerns than direct questioning, and corpus-based coverage checking against job postings, certifications, and domain literature to flag unmapped terms.

This is not theoretical. I built this methodology into a working system grounded in 700+ sources and used it to construct production agents.

What I Built: Maestro

I released Maestro as an open-source implementation of these principles. It is a zero-dependency multi-agent orchestrator that drops into Claude Code, Codex, or Cursor as a single file.

Maestro implements the architecture the research points to:

  • Decision Gate that evaluates whether a task actually needs multiple agents, biased toward single-agent to prevent unnecessary coordination overhead
  • Planner Agent that decomposes complex tasks into parallel and sequential work
  • Specialist Agents with a hard ceiling of 4 per parallel group, based on the DeepMind and DyLAN findings on coordination plateaus
  • Cross-Talk Routing for communication between specialists when outputs affect one another
  • Staff Engineer Review that performs adversarial final verification

The orchestrator handles routing exclusively. It does not plan, does not perform work, and does not review output. That separation prevents the coordination failures that MAST identified as the dominant failure class.

Desktop implementations achieve wall-clock speedups through parallel execution. Cloud deployments remain token-efficient for complex tasks where a single agent would need 15+ sequential messages. The system ships as CLAUDE.md for Claude Code, AGENTS.md for tool-agnostic compatibility, and .cursorrules for Cursor.

The Research Papers

For anyone who wants to go deeper, these are the studies that informed this work:

PaperYearVenueKey Finding
MAST2025NeurIPS Spotlight41-87% failure rates; 79% from coordination issues; 36.9% coordination breakdowns
DyLAN2024COLM3 agents outperform 7; 25% accuracy improvement from dynamic topology
DeepMind Multi-Agent Scaling Study2025arXiv3 scaling laws: tool-coordination tradeoff, 45% saturation, error amplification
Voyager2023NeurIPSSkill library: executable skills indexed by embedding, composed dynamically
GTD (Guided Topology Diffusion)2025arXiv0.3% degradation under failure with redundant topologies vs 13% for rigid
SELFORG2025arXivSelf-organizing communication graphs; Shapley-based contribution estimation
MetaGPT2023-2024SOP-driven structured handoffs: 3.9/4 vs unstructured 2.1/4
Ranganathan Faceted Classification1933Domains as orthogonal axes, not hierarchies; Cartesian product completeness
Knowledge Space Theory1999Prerequisite dependency lattices for domain knowledge states
HAZOP (IEC 61882)Systematic perturbation guide words for completeness verification
EVE Framework2025Multi-query union improves recall up to 24% over single prompts

The One-Sentence Version

Build the simplest architecture that works, split only when quantitative criteria demand it, communicate through structured schemas, match coordination topology to task complexity, and evolve everything through measured outcomes.

Maestro is open source on GitHub. MIT licensed. Zero dependencies. Drop it in and go.


Disclosure: I am the creator of Maestro. This post is grounded in peer-reviewed research that I studied independently before building Maestro, but I have a direct interest in the architectural approach it implements. The research papers are linked above — read them and draw your own conclusions.

Mark Laursen
Mark Laursen

Advisor, founder, and executive producer with 25+ years building technology companies, gaming platforms, and entertainment products. Based in Portugal.