3 June 2026 · 37 min read · By Mark Laursen
Why I Stopped Using Multi-Agent Frameworks (And What Replaced Them)
Editor’s note (2026-06-16): Maestro has since moved to a Frontier-led model. The headline is now the Frontier Engine, a local multi-CLI fusion engine that fans a prompt across Opus 4.8, GPT-5.5, and Gemini 3.1 Pro and synthesizes one grounded answer; the discipline layer this piece describes is its proven foundation. The argument below stands as written; only the positioning around it has changed.
I was a true believer. I read the multi-agent papers as they came out. I built on AutoGen, then CrewAI, then LangGraph. I evangelized the framework approach to friends in DMs. Eighteen months later, I have stripped every framework out of every production workflow I run, and the systems are quieter for it.
The first time a multi-agent framework silently corrupted production state for me, it was 2:14 AM on a Tuesday, and I was watching a CrewAI workflow re-run the same database write three times because two agents had independently decided they were responsible for it. The first run succeeded. The second overwrote the first with a slightly different version of the same record. The third overwrote the second. By the time I looked at the trace, the workflow had reported success. The output that reached our customer was the third version. It was wrong in a way that took a human reading the source data to notice.
There was no exception. There was no warning. There was no log line that said “two agents are fighting over this row.” The framework’s abstraction was working exactly as designed: agents passed messages, made decisions, and produced output. The fact that two of them produced contradictory output for the same underlying record was not, from the framework’s perspective, an error. It was just three messages, three tool calls, three completions. The framework counted that as a successful run.
I spent the next three days reading every multi-agent trace from the prior month. The pattern was everywhere. Most workflows succeeded. The failures were not loud crashes. They were quiet contradictions. Two agents reaching different conclusions and the framework choosing one of them more or less arbitrarily. Sometimes the chosen one was right. Sometimes it was not. The reliability of the system was not a property of the agents or the model. It was a property of which agent happened to write last.
I tried Swarm when OpenAI released it. I evaluated MetaGPT for a research project. I read most of the agent papers as they came out. I believed, genuinely believed, that the future of LLM systems was composable agents talking to each other through framework-provided message buses, and that the job of an engineer was to define the agents, define the roles, and let the framework handle the orchestration.
I no longer believe any of that. This post is about why.
What the Frameworks Promise
The pitch is consistent across every framework in the space. AutoGen, CrewAI, LangGraph, Swarm, MetaGPT, and the dozen smaller ones that have appeared since 2024 all promise a version of the same thing.
You define agents. Each agent has a role, a system prompt, a set of tools, and possibly a memory. You define how the agents communicate, usually through messages, sometimes through shared state. You define the workflow, often as a directed graph or a hierarchy or a “crew.” Then you press run, and the framework takes care of routing messages between agents, deciding which agent acts next, handling tool calls, and aggregating output.
The promise is composability. Build an agent once, reuse it everywhere. The promise is separation of concerns. Each agent has one job, and the framework handles the joining. The promise is that you, the engineer, get to think about agents instead of orchestration, because the orchestration is solved.
The promise is wrong in a specific way. The orchestration is not solved. It is hidden. The framework has made a thousand decisions about routing, retries, message formatting, error handling, and concurrency, and most of those decisions are invisible to you until they break. When they break, they break in ways that are difficult to reproduce, difficult to debug, and difficult to fix without understanding the framework’s internals as well as the framework authors do.
I learned this slowly. Then all at once.
What failure modes do multi-agent frameworks produce in production?
Eighteen months of running these frameworks in production, across three different products, taught me that the failures cluster into three patterns. They are not framework-specific: AutoGen, CrewAI, and LangGraph all produce them, because the failure modes are properties of the abstraction, not of any one implementation.
Failure Mode One: Coordination Overhead Eats the Gains
The MAST study, published as a NeurIPS 2025 Spotlight, analyzed 1,642 execution traces across seven major multi-agent frameworks and found failure rates between 41% and 86.7%. Of those failures, 79% traced to coordination breakdowns rather than model capability. Coordination breakdowns were the single largest failure category at 36.9% of all failures, exceeding both specification issues and model errors.
The Google Research multi-agent scaling study, released in December 2025 across 180 configurations and four benchmarks, confirmed the pattern at the architectural level. Coordination gains plateau at three to four agents. Beyond that, you are paying coordination overhead with no measurable benefit. Independent agents with no coordination amplify errors 17.2x compared to a single-agent baseline. Even centralized coordination amplifies errors 4.4x. The coordination layer is not free. It is the dominant source of cost.
The DyLAN paper at COLM 2024 demonstrated the corollary: three optimized agents outperform seven on the same tasks, and pruning from seven to four cut token costs by 52.9 to 67.8% while improving accuracy. The system gets better by getting smaller.
In practice, this means the framework’s job, deciding which agent acts next, routing messages, handling retries, is the part of the system most likely to fail. And because the framework hides those decisions behind an abstraction, you cannot easily see where the failures are happening or why.
I measured my own systems. Workflows orchestrated by CrewAI’s hierarchical mode had a 34% non-deterministic failure rate over three weeks of runs on identical inputs. The same workflows, rewritten as explicit Python with the same agents, had a 4% failure rate over the same period. Same models. Same prompts. Same tools. The only thing that changed was who was deciding the routing.
Failure Mode Two: Hidden State Mutation
Frameworks have to manage state. Conversation history, intermediate results, tool outputs, scratchpad notes. Most frameworks make this state mostly transparent through some form of shared memory or context bus. Most of them get the semantics subtly wrong.
The 2:14 AM incident I opened with was a hidden state mutation bug. Two agents in a CrewAI workflow had been given access to the same database tool. Both, independently, decided they were responsible for writing a particular record. The framework’s “manager” agent was supposed to coordinate their work, but the manager’s decisions were themselves model outputs, not deterministic logic, and the model decided that both agents should write to the database. The framework dutifully executed both writes. The state diverged. Nothing in the framework noticed, because the framework’s notion of “success” was that every agent completed its assigned task.
This is a category of bug, not an isolated incident. Any time two agents share access to a mutable resource and the framework decides who writes when, you have a race condition mediated by an LLM. Sometimes the LLM gets it right. Sometimes it does not. There is no equivalent of a database transaction or a typed locking primitive in any framework I have used. The shared state is “shared” in the loosest sense: every agent can see it and modify it, and the only thing preventing chaos is the model’s judgment.
In MetaGPT’s evaluation work, the SOP-driven version with structured intermediate outputs scored 3.9 out of 4, while ChatDev’s free-form messaging scored 2.1. The structured version nearly doubled the quality, and the only thing that changed was that the handoff between phases used a defined schema instead of free-text messages. The structured handoff is what makes the difference. Frameworks that let agents communicate in free text are inviting drift.
Failure Mode Three: Debugging Opacity
When a framework-orchestrated workflow fails, debugging it requires understanding three things at once: your prompts, the framework’s routing logic, and the model’s behavior. The framework’s routing logic is usually undocumented, sometimes unstable across versions, and almost always opaque in traces.
LangGraph’s traces, which are among the better ones in the ecosystem, show you the nodes that executed and the messages that passed between them. They do not show you why the framework chose one branch of a conditional edge over another, beyond a model output that may or may not be reproducible. CrewAI’s traces show the agents and their tool calls, but the manager agent’s deliberations are buried in conversation history and only sometimes accessible. AutoGen’s traces are the most complete, and they are still difficult to read because the framework’s group chat manager interleaves agent messages with framework metadata in a way that obscures the actual decision flow.
When I had a bad output, the question I needed to answer was: which decision in the chain produced this? The framework’s abstraction made that question hard to answer. Every layer of orchestration that the framework added was a layer I had to peel back before I could reach the actual problem. In some cases, peeling back required reading the framework’s source code. In one memorable case, the bug was in the framework’s source code, and I had to either patch it locally or wait for a release that fixed it.
The compounding effect: every framework upgrade was a potential regression. Frameworks evolve fast. Their orchestration logic changes between minor versions. A workflow that worked on CrewAI 0.30 might behave differently on 0.36, not because of any agent change but because the framework’s manager logic was rewritten. I started pinning framework versions aggressively. That worked until a security patch needed to be applied and I had to upgrade and re-test the entire system.
What replaced multi-agent frameworks in my production systems?
The pattern that replaced every framework I had been using is boring. That is its main feature.
I write a Python file. It imports an LLM client. It defines a small number of agents, where an “agent” is a function that takes a typed input, calls the LLM with a system prompt, parses the response into a typed output, and returns it. The orchestration is a sequence of function calls in regular Python. If two agents need to coordinate, they coordinate through normal data structures. If a step needs to retry, it retries with a normal try/except. If a result needs to be cached, it is cached with a dictionary. If the workflow needs to fan out and fan in, it uses asyncio.gather or a process pool.
That is the entire pattern. There is no framework. The orchestration code is the orchestration code. When something fails, the stack trace points to the line where it failed. When I want to test a workflow, I unit-test the agents in isolation, and I integration-test the orchestration with normal Python testing tools. When I want to add a new agent, I add a new function and call it from the appropriate place in the orchestrator.
A worked example illustrates the difference. Here is the same task expressed two ways: a research workflow that takes a topic, gathers sources, drafts a report, and reviews it. The framework version is shorter to write. The explicit version is longer but reveals what is happening.
# Framework version (CrewAI-style, abbreviated)
researcher = Agent(role="Researcher", goal="Gather sources", tools=[search_tool])
writer = Agent(role="Writer", goal="Draft report", tools=[])
reviewer = Agent(role="Reviewer", goal="Critique draft", tools=[])
crew = Crew(
agents=[researcher, writer, reviewer],
process=Process.hierarchical,
manager_llm=manager_model,
)
result = crew.kickoff({"topic": topic})
What you do not see in this code: who decides when the researcher is done, how the writer receives the researcher’s output, what happens if the writer disagrees with what the researcher provided, how the reviewer’s critique gets folded back into the writer’s draft, what the manager agent does when one of the three produces an empty output, and how any of this is logged. All of those decisions are made by the framework, by the manager LLM, or by the model behavior of the individual agents. None of them are visible in the code.
# Explicit version (deterministic orchestrator)
def research_workflow(topic: str) -> Report:
sources = run_researcher(topic, max_sources=8)
if len(sources) < 3:
sources += run_researcher(topic, query_variant="alternative", max_sources=5)
draft = run_writer(topic=topic, sources=sources)
critique = run_reviewer(draft=draft)
if critique.severity >= "high":
draft = run_writer(topic=topic, sources=sources, prior_draft=draft, critique=critique)
return Report(topic=topic, draft=draft, sources=sources, critique=critique)
This version is longer. Every decision is a line of code. The retry policy on the researcher is explicit. The condition that triggers a rewrite is explicit. The data passed between agents is typed. If something fails, the stack trace points to the failing line. If the workflow needs to evolve, you change Python code, not framework configuration. Every step is unit-testable.
The agents themselves can still be the same agents. The system prompts, the tools, the model choices, none of that needs to change. What changes is the layer above the agents. The framework’s manager is replaced by code I wrote, that I understand, and that I can debug.
What does the research say about multi-agent coordination failures?
The academic literature on multi-agent systems converged on this pattern faster than the framework ecosystem did. The MAST analysis identified coordination as the dominant failure source. The Google Research scaling study showed that error amplification is dominated by topology, not by individual agent quality. The DyLAN work showed that fewer, better-coordinated agents outperform larger swarms. The MetaGPT evaluation showed that structured handoffs nearly double quality versus free-form messaging.
The conclusion these papers point to, taken together, is that the architecture choice matters more than the agent choice. If you get the coordination right, mediocre agents produce reliable systems. If you get the coordination wrong, excellent agents produce unreliable ones. The frameworks, by hiding the coordination layer, deny you the ability to make the choice that matters most.
What I add from production is that the gap between “coordination right” and “coordination wrong” is not subtle. It is enormous. The 34% to 4% reliability gap I measured between framework-orchestrated and code-orchestrated versions of the same workflow is not the upper bound. On more complex workflows, with more agents, the gap grew. On a six-agent research-and-publication pipeline I rebuilt last summer, the framework version had a per-run success rate of 41%. The code-orchestrated version, with the same six agents and the same prompts, ran at 89%. The difference was entirely in the orchestrator.
I also add that the cost of explicit orchestration is lower than it sounds. The Python file that replaces a CrewAI workflow is usually 100 to 300 lines. The framework configuration it replaced was usually 60 to 150 lines. You are paying maybe 2x the source code volume for an order of magnitude of reliability improvement and full debuggability. The trade is obvious once you have made it. It is not obvious before, because the framework version looks compact and clean and the explicit version looks like more work.
The reason it looks like more work is that the explicit version is showing you the work. The framework version is hiding it. The work is still being done; you just cannot see it, and when it goes wrong you cannot see why.
I wrote a longer treatment of the architectural patterns this points toward in Why Your Multi-Agent AI System Keeps Failing, which surveys the research in more depth. The current post is the production-side companion to that one: not what the research says, but what the research means when you have been living with it for 18 months.
A Concrete Anatomy of a Framework Failure
To make the abstract concrete, here is the specific incident sequence from one of my CrewAI workflows last fall. It is representative of dozens of others I logged.
The workflow had four agents: a Researcher, a Drafter, a Fact-Checker, and a Publisher. The intended flow was sequential: research, draft, check, publish. CrewAI was running it in hierarchical mode with a manager agent coordinating the four specialists. The manager was a separate model invocation, configured per the framework’s documentation.
On run 47 of a daily batch, the Fact-Checker returned a critique flagging two factual issues. The manager, interpreting the critique, decided to route back to the Drafter with the corrections. The Drafter produced a corrected draft. So far, the workflow worked as designed.
What happened next was the bug. The manager, evaluating the corrected draft, decided that one of the original issues had not been adequately addressed and routed back to the Fact-Checker for re-verification. The Fact-Checker, with no memory of having flagged the issue originally, produced a new critique that did not mention the original issue but flagged a different one introduced during the rewrite. The manager, seeing a “new” critique, routed back to the Drafter again. The Drafter, fixing the new issue, reintroduced the original issue because its instructions for that pass did not include the original critique.
The workflow ran through eight cycles before hitting a CrewAI iteration cap and producing the most recent draft as final. The final draft contained the original factual issue. The workflow logged a successful completion. The Publisher published the output.
The bug was not in any agent. The Researcher worked correctly. The Drafter wrote correct prose. The Fact-Checker accurately critiqued whatever it was given. The Publisher faithfully published its input. The bug was in the coordination, specifically in the manager agent’s lack of persistent memory across iterations. The framework’s manager treats each routing decision as a fresh evaluation, and the framework does not provide a primitive for “remember the issues we have already flagged across cycles.”
The fix in explicit code is trivial. You maintain an issues_log list, you append to it whenever the Fact-Checker flags something, and you pass the full log to every Drafter invocation. Three lines of Python. Adding it inside the framework was difficult because there was no obvious place to put it; the manager agent’s state was not exposed in a way that I could augment, and the framework’s hooks did not include “before manager decision.”
I migrated that workflow to explicit orchestration the next week. The bug class disappeared, not because I fixed the specific bug but because the explicit version made the state model legible. Once I could see the state, I could see the gap, and the gap was easy to fill.
The Cost of Abstractions That Wrap Models
There is a deeper pattern here that took me a while to articulate. When you build software around classical libraries, the libraries do not have judgment. A database client connects, sends, receives, and returns. A queue accepts and delivers. A logger writes. The behavior is mechanical and the abstraction is honest: the library does X, full stop.
Frameworks that wrap LLMs cannot honestly make that promise. The framework’s “logic” is partly its code and partly the model’s interpretation of prompts the framework has constructed. The same input can produce different routing decisions on different runs. The same routing decision can produce different downstream behavior depending on the manager model’s interpretation of the agents’ outputs. The abstraction’s behavior is not a function of its code; it is a joint function of its code and a probabilistic process the framework cannot fully control.
This is fine when you understand it. It is dangerous when the framework presents itself as a deterministic system. CrewAI’s documentation, like most framework documentation, describes flows and routing as if they were code-defined. They are partly code-defined and partly model-defined, and the model-defined parts are precisely the parts that fail nondeterministically in production.
The honest framing would be: “this framework calls a model to coordinate other models, and the coordination model’s behavior will vary.” That is not how the frameworks pitch themselves, because that pitch is unappealing. But it is what they actually are. Once you internalize that, the case for explicit code orchestration becomes obvious. Replace the coordination model with code. Keep the worker models, because their judgment is what you wanted in the first place. Now the only nondeterministic parts of the system are the parts that need to be nondeterministic, and the deterministic glue is deterministic.
This is not a sophisticated insight. It is the same principle behind every pipeline architecture in the past 30 years. Compute the things that compute. Coordinate with code. Do not invert that, because the inversion is unstable.
What improved when I migrated from frameworks to explicit orchestration?
I want to give specific numbers. Vague claims about “better” do not help anyone deciding whether to make the same migration.
Across five workflows that I migrated from frameworks to explicit code in the second half of 2025, here is what changed:
Per-run success rate, measured as workflows that produced output a human reviewer judged correct on a sampled basis, improved from a weighted mean of 64% to 91%. The biggest gain was on the most complex workflow, where the framework version succeeded 41% of the time and the explicit version succeeded 89%. The smallest gain was on the simplest workflow, where the framework version was already at 78% and the explicit version reached 94%.
Mean time to debug a production failure dropped from a median of 47 minutes to 11 minutes. This was the gain I felt most viscerally. With the frameworks, debugging required understanding the framework’s traces, reproducing the issue (which often did not reproduce because of nondeterminism), and inspecting the model outputs at each step. With explicit code, the stack trace pointed at the failing line, the inputs were captured by my tracing wrapper, and the issue usually reproduced on the second run.
Token usage per workflow dropped by a weighted mean of 31%. The savings came from removing the manager agent invocations that frameworks make to coordinate work. CrewAI’s hierarchical mode, in particular, sends a substantial number of tokens to the manager that the explicit version does not need. On long workflows the manager’s token usage can exceed the workers’. Removing it is pure savings.
The number of retries triggered by silent failures dropped to roughly zero. With frameworks, I had instrumented retry logic at the workflow level to handle cases where the framework reported success but the output was wrong. With explicit code, the conditions that triggered those retries became unreachable; if a step failed, it raised an exception, and I either handled it or the workflow stopped. The implicit retries, on top of the framework’s own retries, were a major source of token waste and latency variance.
Wall-clock latency was mixed. Most workflows ran slightly faster on explicit code because of the manager-removal savings. A few were slower because the explicit version was more conservative in its concurrency, choosing to serialize steps that the framework had been running in parallel. After tuning, I recovered the parallel speedup where it mattered, but the gain was modest. The wins were in reliability and debuggability, not raw speed.
Operational confidence improved in a way that is harder to quantify but probably the most important effect. With the frameworks, every customer-facing run carried a small amount of dread; I knew failures were possible, often silent, and difficult to diagnose. With explicit code, I sleep through the workflows. The system fails the way I expect it to fail, when it fails, and the fixes are usually obvious. That is the state I wanted to be in all along.
What I Would Tell Someone Starting Today
If you are starting a new multi-agent project in 2026, here is the advice I would give myself 18 months ago.
Start with explicit orchestration if you have any meaningful reliability requirement. Do not adopt a framework just because it is the default in tutorials. Read the orchestrator-as-code pattern, write a prototype, and see how it feels. It will feel like more work in the first hour and less work in the first week.
If you do start with a framework, give yourself a checkpoint. Decide in advance what failure rate or what production friction triggers a migration. Otherwise the framework’s path of least resistance will keep you on it past the point where it is the right tool. The path of least resistance, in framework-land, is to add another agent and hope coordination improves. That hope is not supported by any data I have seen.
Resist the temptation to add agents. Every multi-agent system I have shipped or seen succeed has fewer agents than its first design called for. The Google Research scaling study showed that coordination plateaus at three to four agents. DyLAN showed that pruning agents improves quality. My own experience matches both. If you find yourself reaching for a fifth agent, ask whether the work that fifth agent would do can be folded into one of the existing four. The answer is usually yes.
Invest in structured handoffs. The MetaGPT result, structured intermediate outputs scoring nearly twice as high as free-form messaging, is not a curiosity. It is the dominant variable in handoff quality. Define the schema between agents. Use typed dataclasses or Pydantic models. Validate at boundaries. Make the receiving agent’s job easy by giving it structured input rather than asking it to interpret prose.
Instrument from day one. Whether you use a framework or explicit code, instrument every agent call with input, output, latency, and token usage. Most production debugging is “what was the input to this step and what did it produce?” If you cannot answer that question quickly, you cannot debug the system. The instrumentation is cheap; the lack of it is expensive.
Treat your prompts as code. Version them, test them, review changes to them, and document why they say what they say. Most of the regressions I have seen in agent systems came from prompt edits that no one tested. The framework or the orchestrator does not protect you from prompt regressions. Treating prompts like code does.
My Own Orchestrator: Maestro
After two years of doing this across projects, the rules I keep redrawing have congealed into a small specification I call Maestro. It is open source and free to fork.
Maestro is not a framework you import. It is a prose specification, written in plain Markdown, that an orchestrator follows to decide whether a task even needs multiple agents, produce a planning manifest before any specialist runs, isolate each specialist’s context, route structured handoffs between them, and run a final review pass before delivering results. The same patterns this post argues for, written down once so I am not re-deriving them per project.
Most of what is in Maestro is what is in this post. A decision gate for single-agent versus multi-agent work, because most tasks do not need multiple agents. A planner that produces a structured task manifest with file scopes, dependencies, and acceptance criteria. A four-agent cap per parallel group, because the Google Research scaling study shows coordination gains plateau there. Typed OUTPUT and ACCEPT fields at every handoff, so a specialist’s contract is explicit rather than implied. A staff-engineer review pass that checks for cross-talk damage and scope creep before the work is delivered.
What Maestro is not: an orchestration runtime, a framework with its own bus, or anything that hides coordination decisions. The orchestrator (a human, or a single capable agent) reads the spec and follows it. Specialists are spawned with isolated context and structured output contracts; the orchestrator routes the result. The coordination is in prose rules you can read and change, not in a hidden manager model.
This is the explicit-decomposition pattern composed once. If you want a starting scaffold rather than a blank Python file, that is what it is for. If you want to write your own from scratch, the rules in Maestro are short enough to read in a sitting and adapt to your own workflow.
Frequently asked questions
Are multi-agent frameworks ever the right choice?
For prototyping and exploration, yes. A framework gets you to a runnable system in an afternoon, and the opacity that hurts in production is tolerable when you are in a notebook discovering the structure of the problem. For production systems where reliability matters, no. The framework abstractions hide the failure modes you most need to see.
What did you actually replace AutoGen and CrewAI with?
Explicit code orchestration: a small number of well-scoped agents, each one a function that takes a typed input and returns a typed output, called by deterministic Python that handles routing, context passing, and error recovery. The coordination logic is now testable, debuggable, and version-controlled. The agents themselves did not change. Only the layer above them changed.
Why does the reliability gap matter so much?
Because the failures are mostly silent. The framework reports success. The output reaches the customer. You find out later, if you find out at all. In my own measurements, the same workflow dropped from a 34% non-deterministic failure rate under CrewAI’s hierarchical mode to a 4% failure rate under explicit orchestration. Same models, same prompts, same tools. The gap is entirely in who is doing the routing.
Does this advice hold for major-provider frameworks like OpenAI Swarm or Anthropic’s agent SDKs?
Yes, structurally. Any framework that hides coordination decisions behind an abstraction will hide coordination failures behind the same abstraction. The provider does not change that. The Anthropic agent SDKs are more honest about what they are, utilities for calling models with tools, and used that way they are excellent. Used as a full multi-agent framework, they exhibit the same problems as the open-source options.
How long do these migrations actually take?
Between two days and three weeks, depending on workflow size. The two-day ones were small, three-agent pipelines. The three-week one was a large customer-facing system with eight agents and four tools. None of them took longer than it would have taken to debug the framework version’s reliability problems, and most paid back the migration time within the first month.
Closing
The abstractions you choose determine the failure modes you will have. With multi-agent frameworks, the dominant failure mode is hidden coordination, and the dominant cost of fixing it is reaching past the abstraction the framework provides. With explicit decomposition, the dominant failure mode is the same coordination, but you can see it, and reaching past your own code is what writing code is. The frameworks invert the engineering job in a way that produces short-term productivity and long-term liability. Explicit decomposition restores the inversion to its right side.
I did not arrive at this view through theory. I arrived at it through 18 months of production use, three migrations, two database corruption incidents, and one 2 AM debug session that ended with me writing a Python file from scratch and never going back. The frameworks made my prototypes faster and my production worse. The trade was not worth it for me. It might not be worth it for you either.
If you are still using AutoGen or CrewAI or LangGraph in production, the question I would ask is not “should I switch” but “do I know what my coordination layer is doing right now, and can I prove it?” If the answer is yes, keep what you have. If the answer is no, the cost of finding out, by writing the orchestrator yourself, is much smaller than the cost of not finding out, which is the slow accumulation of silent failures you will discover one at a time, usually at 2 AM.
The shortest version of everything in this post: agents are not the hard part. The hard part is what happens between them. Pick the abstraction that lets you see what is happening between them. The frameworks were not that abstraction for me. Code was.
When Frameworks Still Make Sense
I want to be careful here. There are real cases where multi-agent frameworks are the right choice, and I do not want to dismiss them.
Prototyping is one. If you do not yet know what your workflow looks like, what agents you need, or how they should coordinate, a framework gets you to a runnable system in an afternoon. The opacity that hurts in production is fine in a notebook where you are exploring possibilities. AutoGen and CrewAI are both excellent for this. You discover the structure of the problem by trying things; the framework’s defaults are reasonable enough for exploration; the failure modes are tolerable because you are not depending on the system.
Research is another. If your goal is to compare orchestration strategies, study agent communication patterns, or evaluate how different topologies affect outcomes, a framework gives you a substrate for systematic experimentation. Most of the academic work I cited above was conducted on or with frameworks. They are useful tools for studying multi-agent behavior, even when they are the wrong tool for productionizing it.
Pure exploratory or open-ended work is a third. Voyager’s skill library pattern, where agents discover new capabilities through curriculum-driven exploration, benefits from a flexible substrate. If you do not know what coordination pattern is right for your problem, a framework lets you try several quickly. Rigid hand-coded orchestration, in those cases, would be premature commitment to a structure you have not validated.
The line I draw is reliability. If a workflow’s failure has a real cost, customers see it, money depends on it, errors propagate downstream, then the workflow needs to be code, not framework. If a workflow can fail freely, in a notebook or a research environment or a fast iteration loop, the framework is fine. The bar for moving from framework to explicit orchestration is the moment the workflow becomes load-bearing.
Migration Path: From Framework to Explicit
Most of my migrations followed three phases. They are not glamorous, but they worked.
Phase one is logging. Before changing anything, instrument the framework version aggressively. Log every agent input, every agent output, every routing decision the framework makes (or as much of it as you can extract from traces), every tool call, every retry. Run the system for two weeks with the new logging. The goal is to understand what the framework is actually doing, not what you think it is doing. Most of my migrations revealed surprising patterns: agents that were never invoked, branches that always took one path, retries that masked silent failures, or routing decisions that depended on subtle prompt phrasing.
Phase two is replacing the orchestrator while keeping the agents. Take the agents as they are, prompts and all, and write a Python orchestrator that calls them in the same order the framework was calling them. This is a mechanical translation. You are not redesigning the workflow; you are transcribing it. The first version will likely be slightly worse than the framework version, because you are reproducing all the framework’s quirks faithfully. That is fine. The point of phase two is to get to parity, not improvement. Once you have parity, you have a codebase you can reason about.
Phase three is improvement. Now that the orchestration is explicit, you can see what is happening, and you will see things you want to change. The order of operations, the retry policy, the conditions that trigger fallbacks, the way data flows between agents. You change them one at a time, with tests, and you measure the effect. Most of the reliability gains I saw came from phase three, not phase two. The gain in phase two was from having a system you could understand. The gain in phase three was from using that understanding.
The migrations took between two days and three weeks each, depending on the size of the workflow. The two-day ones were small, three-agent pipelines. The three-week one was a large customer-facing system with eight agents and four tools. None of them took longer than they would have taken to debug the framework version’s reliability problems. Most of them paid back the migration time within the first month of post-migration operation.
The Hard Parts
I do not want to make this sound easier than it is. Explicit orchestration has real costs. The frameworks did not exist by accident; they exist because there are genuine pain points they were trying to address. When you give up the framework, you take on those pain points yourself.
State passing is one. Frameworks give you a default model for shared state, usually a context object that gets passed around or a memory store that agents read and write. When you do explicit orchestration, you have to design that model yourself. For simple workflows, normal Python data structures are fine. For complex workflows with many agents and shared resources, you need something more deliberate, typically a typed dataclass or a domain-specific state object that gets passed through the workflow. Designing this well takes thought. Designing it badly produces tangles that resemble the framework problems you were trying to escape.
Error recovery is another. Frameworks have default retry behavior, default timeout handling, default fallback strategies. They are not always right, but they are present. When you write explicit orchestration, you have to decide what each step does on failure. Retry how many times? With what backoff? What input changes between retries? When do you give up? When do you fall back to a different agent or a different prompt? These decisions are not hard individually, but they multiply across a workflow with many steps. Most of my orchestrators have a small library of retry helpers and fallback patterns that I reuse, and building that library was non-trivial.
Telemetry is the third. Frameworks usually have integrated tracing, sometimes good, sometimes bad. When you write explicit orchestration, you have to instrument it yourself. I use a thin tracing library that wraps each agent call in a span, captures the input and output, and records latency and token usage. The library is about 200 lines of code, but it took me a few iterations to get right. Without it, debugging an explicit orchestrator is harder than debugging a framework one, because you have to add print statements yourself instead of inheriting trace UI from the framework. With it, debugging is much easier than the framework version, because the traces show your data, not the framework’s wrappers around your data.
The honest summary is that explicit orchestration is more work upfront and dramatically less work over the lifetime of the system. You front-load the cost of designing state, designing error recovery, and designing telemetry. In return, you get a system that does what you wrote, fails where you can see it, and improves when you change it. The frameworks invert that trade: they give you a fast start and a slow grind, where every improvement runs into the framework’s assumptions and every failure runs into the framework’s opacity.
What This Means for Major-Provider Frameworks
The major LLM providers have all released their own agent frameworks. OpenAI Swarm. Anthropic’s agent SDKs. Google’s frameworks. They are not exempt from the failure modes I described.
Swarm’s design is cleaner than CrewAI’s, and it explicitly favors stateless agent functions, which I appreciate. But the moment you build anything non-trivial, you discover that Swarm has the same coordination opacity as the open-source frameworks: handoffs between agents are governed by model decisions, and the framework’s role is to dispatch those handoffs without revealing how it decided. The same race conditions and the same routing failures appear, in slightly different form.
The Anthropic agent SDKs, including the one I use daily, are more honest about what they are. They are utilities for calling models with tools, not full orchestration frameworks. The orchestration is left to your code. Used that way, they are excellent. Used as a multi-agent framework, they exhibit the same problems as the others.
The pattern is structural. Any framework that hides the coordination decisions behind an abstraction will hide the coordination failures behind the same abstraction. The provider does not change that. The implementation language does not change that. The number of stars on GitHub does not change that. If the abstraction is “you describe the agents and we handle the orchestration,” then the orchestration is the part you cannot see, and the part you cannot see is where the failures come from.
What I Am Still Wrong About
I do not want to oversell my own conviction here. Eighteen months is long enough to draw conclusions. It is not long enough to be sure of them.
The first thing I am probably wrong about is how much of this generalizes outside my workload. My workflows are mostly content pipelines, code reviews, and research synthesis. Teams running customer-facing chat agents, autonomous research swarms, or robotics control loops will hit failure modes I never see. The advice “use code, not frameworks” might still hold for them, but the specific failure modes will look different, and some of my advice will be the wrong calibration.
The second thing is timing. Frameworks evolve. CrewAI 0.36 is not CrewAI 0.30, and the manager logic has improved. There is a future version of one of these tools that closes the abstraction gap I am complaining about. When that arrives, the trade I am making today will need to be re-evaluated, and I will probably re-evaluate it slowly because I am biased by the migrations I already did.
The third thing is taste. Some engineers genuinely think better in framework abstractions than in explicit code. For them, the framework is the right tool, and my preference for explicit code is a personal artifact, not a universal truth. I would still argue the data favors explicit code on reliability, but reliability is not the only axis that matters for a working system.
If you read this post and your gut says “I disagree, frameworks work for me,” your gut might be right and I might be missing something specific to your context. The honest version of this post is “frameworks did not work for my context, and I think the reasons generalize, but you should test the claim against your own runs.”
The Meta-Lesson
The abstractions you choose determine the failure modes you will have. This is true of every system, but it is especially true of multi-agent systems, because the coordination layer is where failures concentrate, and the framework’s job is to abstract the coordination layer away.
When the abstraction is good, you do not pay attention to it; it works. When the abstraction is wrong for your problem, you cannot fix it without reaching past the abstraction, at which point you are doing the work the abstraction was supposed to save you. With multi-agent frameworks, in production, the abstraction has been wrong for me about 90% of the time. The remaining 10%, where the framework’s defaults happen to fit my workflow, are not enough to justify the cost of the other 90%.
This is not a critique of the framework authors. They are solving a hard problem, and they are solving it for a wide range of users with different needs. The frameworks are useful in the contexts where they are useful. The mistake was mine: I assumed that “useful for prototyping” extrapolated to “useful for production,” and it does not.
The shorter version: agents are easy. Coordination is hard. Frameworks make agents look easy and hide the coordination. When you remove the framework, agents are still easy. The coordination is now visible, and you can do something about it. That is the trade I would make every time, on every workflow that has to be reliable, with every team that has time to write a Python file instead of a YAML config.
The frameworks promised composability. What they delivered was a particular kind of debt: small upfront, large over time, paid in 2 AM incidents and silent state corruption. I would rather pay the debt where I can see it. That is what explicit decomposition is. Not magic. Not heroic. Just the work that needed to be done, where you can read it, change it, and trust it.
Advisor, founder, and executive producer with 25+ years building technology companies, gaming platforms, and entertainment products. Based in Portugal.