6 May 2026 · 28 min read · By Mark Laursen
What 18 Months of Production AI Agents Actually Taught Me
It is humbling to sit and total up eighteen months of running AI agents in production. There have been more dead ends, more 2 AM dashboards, and more rewrites than I would have admitted to anyone at the start.
The first call came at 2:14 in the morning. A retry storm had been running for nine hours on a workflow that was supposed to take eleven minutes. Each failure had triggered a retry. Each retry had bought a fresh context window. Each fresh context had re-summoned the same downstream tools. By the time I logged in, the agent had spent more on a single document review than the contract was worth.
That was month three. I had been told, by the vendor’s marketing pages and by my own optimism, that production agents at this point would be saving me time. Instead I was sitting in front of a billing dashboard at 2 AM, watching a number go up.
Eighteen months later, the systems run quietly. The bill is half what it was at the peak. The agents do work I used to do, and they do it well enough that I notice when they break, not when they succeed. Getting here was not a smooth curve. It was a series of failures that did not appear in any paper I had read, and a smaller number of patterns that worked better than the literature predicted.
This post is the retrospective I wish I had read before I deployed the first agent. Five things that broke in production, three that worked unexpectedly, the cost arc no one warned me about, and a short list of advice I would give my past self.
What did 18 months of production agents actually look like?
The systems I run are an orchestration framework I now publish as Maestro, an agent governance proxy I now publish as Govyn, and the connective tissue between them: scheduling, retries, audit logs, observability. The first version went live in October 2024 against a handful of internal workflows. By the time of this writing, the systems have processed roughly 1.4 million agent invocations across content pipelines, code workflows, support triage, research synthesis, and a few odd jobs that do not fit a neat category.
The relevant background numbers, for honesty:
- 18 months of production use, October 2024 through April 2026
- Roughly 1.4 million agent invocations
- Peak monthly LLM spend: $14,200, in February 2025
- Current monthly LLM spend: $6,800, on more than triple the workload
- Tools and MCPs in production today: 47 across all surfaces
- Number of postmortems written: 23
That last number is the only one that actually matters. The agents broke 23 distinct ways that were worth documenting. Most of those breakages are now rules in a runbook. A few of them are now code, because the runbook entries kept being violated. The rest of this post is what those postmortems taught me.
Before deploying anything, I had read the literature. I expected hallucination to be the dominant failure mode. The early papers framed AI agents as fundamentally creative systems that occasionally invented facts, and the research community had built a small industry around hallucination benchmarks. So I built monitoring around it. I wrote prompts that pushed models toward citation-rich outputs. I assumed I would spend a meaningful fraction of my time on factuality regressions.
Almost none of this turned out to be the problem. Hallucination was a real concern for the first three weeks, before I learned to constrain outputs and ground them in retrieved context. After that, I can count the number of factual hallucinations that reached production users on one hand. Reasoning failures happened, but they were rare. They were obvious when they did. They failed loudly and the system rolled back.
The actual failures were boring. They were not about whether the model could think. They were about what happens when you put a thinking system inside a real environment with real tools, real users, real budgets, and real timeouts. The model was not the problem most of the time. The system around the model was.
Five things that broke
These five categories account for 21 of the 23 postmortems. They are not in priority order. They are in the order I encountered them.
1. Silent compaction corruption
The single most expensive failure mode I have seen in production is also the hardest to detect. The agent never throws an error.
In month six, I noticed that a long-running research agent had started producing summaries that contradicted the source documents. Not by much. A date shifted. A name swapped between two people in the same paper. A claim attributed to the wrong study. Each individual error looked like a bad model day. It was not. The agent’s context had been auto-compacted three times during the run, and each compaction had collapsed the citation map into a paraphrased summary that no longer pointed back to the original passages. The agent was operating with confidence on a story it had partially forgotten.
A March 2026 analysis from VentureBeat put a number on this: nearly 65% of enterprise AI failures studied in 2025 traced to context drift or memory loss during multi-step reasoning, not to raw context exhaustion. “Context Blindness” was the single largest failure mode in a survey of 591 production incidents from 2023 to 2026, accounting for 31.6% of cases. The most expensive AI failure described in that report produced no error, no alert, and no red dashboard. The system was fully operational, and consistently, confidently wrong.
That is the version I lived. The agent was up. The dashboards were green. The output looked clean. The post-compaction reasoning had drifted from the pre-compaction facts, and nothing in the standard observability stack caught it.
The fix was three changes, in this order. First, every long-running task gets broken into specialist segments with fresh context windows, communicating through structured handoff manifests rather than free-form summaries. Second, every checkpoint writes a compact artifact (file scope, decisions, open questions, freshness timestamp) that the next segment reads from disk. Third, age-tagged working notes get re-validated against the source if they have been alive across more than ten messages. The pattern is the same one Anthropic describes in its own agent autonomy research: expert users do not flip a switch and walk away; they let the agent run while watching for drift, and they intervene about 9% of the time. Drift is the universal background failure. Compaction is the mechanism that makes it silent. Three changes, three months of vigilance, one quiet system.
2. Retry storm cost explosion
The 2 AM call I opened with was not an isolated event. It was the loudest example of a class of failures that I now think of as the dominant cost-side risk in production agents.
A retry on a deterministic API is a small cost. A retry on a model invocation that brings a fresh context window, re-runs upstream lookups, re-summons every tool, and re-emits its full reasoning trace is a multiplier on a multiplier. When the retries stack on a job that is fundamentally unrecoverable, you do not get a failure. You get an invoice.
Mine was small compared to what is now in the public record. A multi-agent data pipeline documented in 2026 got stuck in a retry spiral for eleven days, burning $47,000 in API calls before anyone noticed, with traditional monitoring reporting “SYSTEM NOMINAL” the entire time. A separate February 2026 incident saw a data enrichment agent misinterpret an API error code as “try again with different parameters” and run 2.3 million API calls over a weekend, costing $47,000. In both cases, the agent was technically functioning. The system around it had no concept of “this is a runaway loop, stop.”
Our retry storm was less dramatic, but the mechanism was identical. A workflow had hit a transient downstream error. The orchestrator’s default retry policy was exponential backoff with no ceiling. The agent’s default behavior on retry was to start over with a fresh prompt that included the original instruction. There was no idempotency key, no “this is the third attempt, do not start over” signal, no maximum total attempts, and no spend limit per job.
By the time I noticed, the retry counter was at 312.
The fix is unsexy and important. Every agent invocation now runs against:
- A maximum total retry count (we use 3, with hard ceiling)
- A maximum total spend per job (job-class-specific budgets)
- A dead-letter queue for unrecoverable jobs
- An idempotency signal in the prompt that tells the model “you have tried this N times, do something different or stop”
That last one matters more than people think. Models will dutifully try the same approach a fourth and fifth time if you do not tell them they are repeating. Telling them changes the strategy. It does not eliminate retries. It eliminates retries that are the same retry.
I now treat any agent system without an explicit retry budget as broken in the same way a web server without a request timeout is broken. It will not always fail. But when it fails, it will fail in the most expensive way available.
3. Tool call cascade with no policy gate
This is the failure mode that should have been the easiest to predict and was the hardest to fix.
Models with tool access are powerful. Models with tool access and no governance are dangerous in ways that scale with the tool surface. We added a tool. The tool worked. We added another tool. The combination of two tools created behaviors neither tool’s documentation described. We added a third tool. Same story.
Anthropic’s own internal incident log, which the company described publicly in late 2025, includes examples of agents deleting remote git branches from misinterpreted instructions, uploading auth tokens to the wrong cluster, and attempting migrations against production databases. None of those were model-capability failures. They were the result of the model being overeager and taking initiative in a way nobody had intended, in a system that did not have a structural gate to stop it.
Our worst version cost us a real customer hour to clean up. An agent had two tools: one that read customer records, and one that wrote to customer records. A user’s prompt, processed normally, somehow ended up with the agent reading record A, deciding it was relevant context, and writing the read content into record B. No malice. No prompt injection. Just the model interpolating across two tool results in a way no one had considered when the tools were added.
The lesson is this: tool governance cannot live in the prompt. The model will not reliably follow rules that say “do not write fields that you read from a different record.” Those rules belong outside the model.
We built the Govyn proxy because of this incident. I covered the underlying architecture in What is an AI agent policy engine, but the short version is: every tool call now passes through a proxy layer that enforces per-tool allowlists on which fields can be written, per-pair rules on what is allowed to flow from one tool to the next, approval gates on writes that touch human-readable PII, and an audit log that captures the prompt, the tool inputs, the tool response, and the post-conditions.
The proxy adds latency. It is worth it. It is also the only thing that lets me sleep when an agent has 47 tools available to it.
4. Multi-agent handoff drift
The single largest source of pain after the cost stuff was not any individual agent failing. It was agents handing off to each other.
Specialist A would do its job correctly. Specialist B would do its job correctly. The handoff between them would lose information that neither A nor B had been instructed to preserve, because the contract between them was implicit. A short summary in natural language, generated by A and consumed by B, with no schema enforced.
The MAST study (NeurIPS 2025) put a number on this in academic terms: 79% of multi-agent failures come from coordination breakdowns, not model capability. That number was not abstract for me. We saw it most painfully in a content pipeline that ran research, draft, and review as separate agents. The researcher would surface five citations. The drafter would receive a four-sentence summary. The reviewer would receive the draft alone. By the time the reviewer flagged a missing citation, the citation had been gone for two agents. The information existed in the system. It just was not where it needed to be.
The fix was not to merge the agents. It was to introduce a structured artifact between them: a JSON schema for handoffs, with explicit fields for citations, claims, open questions, and disagreements. Once we did that, our cross-agent failure rate dropped by something close to 70%. MetaGPT’s authors had reported a similar improvement in their 2023 paper. We were re-discovering what they had documented, the hard way, in production.
I will repeat this lesson because it is the most important one in the whole post: the contract between agents is the architecture. If your contract is “Agent B reads whatever Agent A wrote,” you do not have a multi-agent system. You have a game of telephone with a billing meter. I wrote about the architectural shape of this in Why Your Multi-Agent AI System Keeps Failing; the one-paragraph version is that a thin orchestrator routing structured handoffs between scoped specialists outperforms hierarchical “boss” patterns and free-form peer messaging in every workload I have measured. Define the schema. Enforce it. Stop guessing.
5. Trust calibration mismatch
The fifth failure took longer to recognize because it did not look like a system bug. It looked like a culture problem.
After the first six months of running agents, the team had developed an implicit trust model. Some outputs got reviewed line by line. Others got skimmed. Others got accepted on the strength of “the agent has been good at this lately.” Nobody had written this trust model down, and nobody had checked whether the trust map matched the agent’s actual reliability map.
The result was a class of incident I now call calibration mismatch. The agent was strong on tasks the team treated as risky (so reviews were thorough and slow). The agent was weak on tasks the team treated as routine (so bad output shipped). The supervision tax was concentrated on the tasks where it least helped.
The METR study from July 2025 made this concrete in a different domain. METR ran a randomized controlled trial of 16 experienced open-source developers using AI tools and found that the developers took 19% longer with AI assistance, despite predicting they would be 24% faster and reporting after the study that they felt 20% faster. A 43-percentage-point swing between perception and measurement. Trust calibration in the wild is bad enough that humans cannot self-report whether their tools are helping them.
The fix was per-task confidence routing. Every agent task class now has a documented expected reliability rate, a documented review depth required at that rate, and a flag that fires when the actual reliability drifts more than 10% from expected. The review depth is driven by the data, not the gut. Teams that run agents at scale and skip this end up with the same shape of mismatch I had: thorough reviews where reliability is high, lazy reviews where it is low, and a shipped-bug rate that does not match anyone’s mental model. The Augment Code analysis of the METR results came to the same conclusion: the issue is rarely the model’s accuracy; it is the absence of structured workflows that match supervision intensity to actual error rates.
Three things that worked unexpectedly
If five things broke, three things worked better than I had any right to expect. These are the patterns I would deploy first if I were starting over today.
1. The working notes pattern
The pattern is simple. Before a specialist agent reads a file or runs a tool query that takes meaningful tokens, it produces a structured working note: file purpose, key facts, dependencies, open questions, and a freshness timestamp. The note becomes the agent’s reusable context. Subsequent tasks reference the note rather than re-reading the source.
The benefits showed up faster than I expected. Long workflows saw a 40% to 60% reduction in token usage. Reasoning got more consistent across multiple invocations on the same problem. And the working notes themselves became reviewable artifacts for human operators, which made debugging dramatically easier.
The pattern is borrowed from how research scientists actually work. They do not re-read papers every time they need to cite them. They keep notes. The model, given the same affordance, behaves more like a careful researcher and less like a junior who Googles everything fresh each time.
The discipline that makes this work is age-tagging. A working note from twenty messages ago might be stale. We mark every note with the message count and ten-message-old notes get re-validated against source before the agent acts on them. Without age-tagging, the pattern degrades into stale-context errors. With it, it is the most useful single change we have made.
2. Single specialist plus orchestrator
The DyLAN paper (COLM 2024) reported 3 optimized agents outperforming 7 on the same tasks. The Google DeepMind 2025 study, “Towards a Science of Scaling Agent Systems,” tested 180 multi-agent configurations and found that coordination gains plateau at 3-4 agents, with tool-heavy tasks suffering disproportionately from multi-agent overhead. I had read both before deploying anything. I still under-applied the lesson.
My first instinct was to write big, sophisticated agent prompts that handled many cases. The prompts grew. They became hard to debug. They started to drift, where adjusting one rule for one case would move the model’s behavior on five other cases I had not noticed.
The fix was to split the agent into smaller, scoped specialists, each with a focused prompt and a tightly bounded responsibility, with a thin orchestrator routing work between them through structured handoffs. The orchestrator does not plan, does not review quality, and does not write code; it spawns, sequences, detects cross-talk, routes context, and delivers. The specialists do not see each other’s prompts; they consume the structured artifact the orchestrator hands them.
The numbers from our deployment match the literature with surprising fidelity. Two specialists vs one generalist gave 18% better task completion at slightly higher token cost. Three specialists vs two gave 12% better at neutral cost. Four specialists vs three gave 4% better at slightly higher cost. Five specialists vs four was worse, with notably higher token cost.
The plateau is real. I now treat “do I need a fifth agent” as a question that should almost always be answered “no, redesign one of the four.”
3. Cost-aware retry budgets
This is the pattern I was most reluctant to build, and it had the largest direct impact on cost.
Every agent invocation runs against a job-class-specific retry budget that combines a maximum total spend, a maximum retry count, and a dead-letter queue. The model is told, in its prompt, what budget it is operating under. The orchestrator enforces the budget regardless of what the model decides.
The interesting thing was not the cost saving from preventing runaway loops. That was real, but it would have happened anyway from any naive retry cap. The interesting thing was that the budget changed model behavior in a way I had not predicted. When the model is told “you have one retry left and a $0.50 budget for this task,” it behaves more carefully. It checks its assumptions. It picks a smaller approach. It asks for clarification more often. The budget functions as a steering signal, not just a hard limit.
This has an analog in human work. People given an explicit time budget for a task work differently than people given an open-ended one. The agent does too. The budget is not an enemy of quality; it is a tool that biases the model toward focused work.
How the cost curve actually moved
The piece of this story I have heard the least about from other practitioners is the cost shape. Vendors describe AI agent costs as if they are a consumption metric: more usage, more spend, simple. In production, the cost curve has a much more specific shape, and it is one I would have planned around if anyone had warned me.
The arc has three phases.
Months 1 to 3: spend climbs faster than usage. Early on, every workflow you add costs more than the workflow that came before, because you have not yet built the supporting infrastructure that makes agents cheap. There is no caching layer. There are no retry budgets. There are no fallback chains. Every invocation hits the most expensive model and burns the most tokens. We tripled spend in three months while only doubling workload. The vendor’s pricing page does not warn you about this; it assumes you have already built the infrastructure that turns linear scaling into linear scaling.
Months 4 to 9: spend stabilizes as governance catches up. This is the least dramatic phase and the most important. Every infrastructure piece you add, prompt caching, structured retry budgets, model routing, working notes, takes a percentage off the bill. None of them individually is a 5x improvement. Together, they keep total spend roughly flat while you add capacity. We held spend at $9k to $12k for six months while the workload grew by something like 80%. That is the quiet success state. It does not look like a win until you see how steep the climb would have been without the changes.
Months 10+: spend declines as cheaper models displace expensive ones. This is the phase that surprised me. Once we had reliable evaluation infrastructure for our agent workflows, we could test whether a cheaper model could handle a given subtask. The answer, more often than I expected, was yes. Roughly 70% of our agent invocations now run on a model that costs a fraction of what we paid in month five. Frontier models stay reserved for the hardest cases and the user-visible final outputs. The cost-to-quality ratio of 2026’s mid-tier models on our workload is dramatically better than 2024’s frontier was, and the gap is still widening.
The number that surprised me most: at 18 months, we were running 3.2x the workload of month five at less than half the spend. None of that came from the vendor lowering prices. All of it came from our infrastructure improving. I covered the curve from a different angle in AI Is Running Out of Power, Data, and Quality, where the macro picture maps onto the same shape: the cost-out path is engineering, not procurement.
If you are at the start of an agent deployment, budget for an aggressive climb in the first six months and plan to bend the curve through engineering, not through procurement. The vendors will not fix this for you. Nor should they.
What I would tell my past self
If I could send a single message back to October 2024, it would be a list of five things. Each of them is a sentence I now repeat to other teams who are deploying their first agents.
Build the proxy before the second agent. The first agent does not need governance infrastructure. The second one does. By the time you have three agents and seven tools, retrofitting governance is harder than building it from the start. Build it on day one and treat the absence of it as an unshipped dependency.
The contract between agents is the architecture. Not the agents. Not the prompts. The contract. If two agents communicate through unstructured prose, your system will fail at the seam, every time, in ways that look like model failures but are actually missing schemas. Define the schema. Enforce it.
Retry budgets are non-negotiable. Every agent invocation runs against a maximum total spend, a maximum retry count, and a dead-letter queue. There is no production deployment without all three. The cost of building this is one afternoon. The cost of skipping it is an unbounded number of 2 AM calls.
Smaller, specialized agents beat one big one. Stop writing mega-prompts. Three small agents with focused responsibilities and a structured handoff outperform one big agent every time, on every workload I have measured. The DyLAN paper called it. The DeepMind paper measured it. Production confirms it.
Plan for the cost climb, then plan to bend it with engineering. Spend will rise faster than usage in the first 90 days. This is not a sign of failure; it is the cost of running production agents before you have built the surrounding infrastructure. Budget for it explicitly, measure unit economics from day one, and assume that bending the curve back down will be your work, not the vendor’s. The teams that get hurt by this are the teams that did not name a budget and were surprised by the invoice.
The deeper version of all of this lives in The Automation Paradox: the J-curve is real, the dip is engineering work in disguise, and the teams that abandon the deployment in the trough are the teams that did not budget for the climb back out.
What I am still wrong about
If I have learned anything in 18 months, it is to be suspicious of anyone (including me) who claims to have AI agents figured out. There is a category of question on which I have changed my mind multiple times, and there is a category on which I am still going to be wrong.
The first one I am probably still wrong about: how much of this generalizes to teams that are not me. My deployment is small, my domain is narrow, and the people writing the code are people I trust to read postmortems carefully. A team of forty engineers shipping agents into a regulated industry will hit failure modes I have never seen. Some of my advice will be wrong for them. The honest version is “this is what worked at my scale; expect your scale to add categories I do not cover.”
The second one: I keep underweighting how fast the underlying models improve. Every six months I have a list of patterns that work because the model is bad at something. Every six months that list gets shorter, because the model is no longer bad at that thing. Anthropic’s own agent autonomy research frames this in a way I now use as a checklist: autonomy is co-constructed by model, user, and product. The model side moves on its own timeline. My infrastructure should not assume the model will be the same six months from now.
The third one: I am still unsure how much governance overhead is healthy. The proxy adds latency. The retry budgets add operational friction. The structured handoffs add development time. At some point on the curve, more governance starts costing more than it saves. I have not hit that point yet, but I cannot draw the line in advance.
The fourth one: I do not yet know where the floor is on cost. We have cut spend in half over twelve months on triple the workload. There is probably more to cut. There is probably also a point where additional cuts make the system fragile, and I have no clean way to predict which optimization moves us closer to that point.
The fifth and last: I am uncertain whether what I have learned about decomposition holds for fully autonomous workflows. My systems still have a human in the loop on most consequential decisions. I have run a handful of fully autonomous workflows and they have mostly worked, but the sample size is too small to make general claims. The teams running larger autonomous deployments report failure modes I have only seen in passing.
Anyone who tells you they have AI agents figured out is selling you something. So am I, in a sense. The right framing is: this is what eighteen months taught me. The next eighteen will rewrite some of it.
What is next
Forward-looking sections in posts like this tend to age badly, so I will keep this short and concrete.
In the next 6 months, I expect three things on my own roadmap. The model routing layer will keep getting more aggressive about pushing work to cheaper tiers, because the gap between cheap models in 2026 and frontier models in 2024 is now wide enough that most tasks can run on the smaller side without quality loss. The proxy will pick up more deterministic policy primitives (rate limits, content filters, per-customer egress rules) that are currently approximated in prompts. And the orchestrator will move from a single-process design to a small queue-backed worker pool, because the next step in cost engineering is to share caches and idempotency state across runs.
In the next 12 to 18 months, the questions I am paying attention to are: whether sub-agent specialization will hold up as base models gain longer effective context, whether cheaper models will catch up to the point where frontier models become a reserve resource, and whether the operational discipline that pays off at small scale survives translation to teams of forty or a hundred.
The thing I am not betting on is the next big architectural breakthrough. Most of the gains in my system over the last twelve months came from boring engineering applied carefully, not from a paradigm shift. I expect the next twelve to look the same. If I am wrong, I will write the post.
The boring middle
The hardest thing to convey to anyone deploying agents for the first time is the shape of the destination. The interesting moments are the failures and the breakthroughs. Most of the eighteen months was neither.
Most of it was iterating on a runbook. Adding a check, removing a check, tightening a budget, loosening a permission, fixing a race condition in a retry handler, watching a metric for two weeks before deciding it had stabilized. The work that pays off is unglamorous. It looks like normal engineering, because that is what it is.
The agents I run today are not magic. They are a set of small, scoped LLM invocations behind a lot of governance, observability, and operational discipline. They handle work I used to do, and they free me to spend my time on work that requires me. They occasionally surprise me. They occasionally embarrass me. Most of the time, they do their job and produce a log that I scan and approve and forget about.
The destination is a system that mostly works, where the failures are noisy enough to catch and rare enough to learn from, and where the cost is sustainable. That is a less exciting story than the marketing suggests. It is also the only one that has matched my reality.
If you are starting your own deployment, I will leave you with the one piece of advice that I think contains all the others. Production agents are an engineering problem, not a model problem. The model is a piece of the system. The system is what determines whether you ship something useful or burn down a budget at 2 AM. Build the system first. The model will keep getting better. Your operational discipline is what determines whether you are ready to take advantage of each upgrade.
Eighteen months in, the systems are quiet. The bill is reasonable. The work gets done. That is the goal. More boring, more rewarding, and more achievable than the people selling you agent platforms want you to believe. By next April we should have clarity on whether the patterns hold past two years; the coming six months will tell. I am grateful for the teams who shared their incident logs with me, and for everyone still doing this work in public so the rest of us can learn from each other.
Disclosure: I build Maestro (open-source agent orchestration) and Govyn (agent governance proxy). The lessons in this post informed both, and the products embody what those lessons taught me. Read the research papers linked above, run your own experiments, and treat any retrospective, including this one, as a starting point rather than a destination.
Advisor, founder, and executive producer with 25+ years building technology companies, gaming platforms, and entertainment products. Based in Portugal.