Why do AI agents initially decrease productivity?

Three compounding factors: the supervision tax (humans monitoring agent output), the correction loop (fixing agent mistakes costs more than doing the task manually), and trust calibration (teams learning when to trust and when to override). These costs front-load before the agent learns your patterns and the team learns the agent's capabilities.

How long does the productivity dip last when deploying AI agents?

Research and industry data suggest 2-8 weeks for task-specific agents, 1-3 months for workflow agents, and 3-6 months for autonomous multi-agent systems. The duration depends on task complexity, feedback loop speed, and governance maturity.

What accelerates the transition from productivity loss to productivity gain?

Three levers: structured feedback loops (corrections improve the agent faster), graduated autonomy (start supervised, expand scope as trust builds), and observability (measuring what the agent actually does, not just what it produces).

Is the automation paradox a sign that AI agents don't work?

No. It is the same J-curve observed in every automation wave since the industrial revolution. Bainbridge documented it for cockpit automation in 1983. The dip is predictable and manageable. Teams that plan for it reach the crossover point. Teams that panic and abandon agents never do.

← Back to Blog

22 April 2026 · 30 min read · By Mark Laursen

The Automation Paradox: Why AI Agents Create More Work Before They Create Less

AI agents automation productivity research management

I deployed my first autonomous AI agent system in late 2025. The agent was handling code generation tasks, context-aware refactoring across multiple files, writing tests, fixing bugs from issue descriptions. For the first two weeks, my output dropped. Not a little. Measurably, embarrassingly dropped. I was spending more time reviewing the agent’s work, correcting its mistakes, and second-guessing its decisions than I would have spent doing the work myself.

My instinct was that the tool was broken. That the promises were overblown. That I should go back to writing everything by hand.

I did not. And by week six, my throughput was 30% higher than it had been before the agent. By week ten, closer to 40%. The tool was not broken. I was living through a pattern that Lisanne Bainbridge described in 1983, that Erik Brynjolfsson modeled mathematically in 2021, and that thousands of teams are rediscovering right now as they deploy AI agents into production workflows.

The pattern has a name. It is called the J-curve. And understanding it is the difference between teams that push through to genuine productivity gains and teams that abandon their agent deployments at the worst possible moment: right before the crossover.

What Is the Automation J-Curve?

In 1983, Lisanne Bainbridge published “Ironies of Automation” in the journal Automatica. It has since attracted over 1,800 citations and is considered one of the most influential papers in human factors engineering. Her argument was counterintuitive and, four decades later, still not widely understood: the more advanced an automated system becomes, the more demanding the remaining human role within it.

Bainbridge was studying industrial process control, chemical plants and flight systems where human operators monitored automated systems. She observed that automation does not eliminate human work. It transforms it. The routine tasks disappear. What remains are the exceptional cases, the failures, the edge conditions that require the deepest expertise. And the humans responsible for handling those exceptions have had their skills atrophied by months of passive monitoring. The system that was supposed to make operations more reliable had, through a precise mechanism, made the human safety net less capable.

This is not ancient history. It is playing out right now in every engineering team deploying AI coding agents, every support organization rolling out conversational AI, every content team adopting AI writing tools.

Erik Brynjolfsson, Daniel Rock, and Chad Syverson formalized the broader pattern in their 2021 paper “The Productivity J-Curve” (American Economic Journal: Macroeconomics). They demonstrated that general purpose technologies, the transformative ones like electricity, computing, and now AI, follow a consistent adoption curve. Productivity does not rise immediately. It dips first. The dip comes from the massive complementary investments required: process redesign, workforce retraining, organizational restructuring. These investments are expensed, not capitalized, so they show up as costs before the benefits materialize.

Their model explains why electricity entered factories in the 1890s but productivity gains did not appear until the 1920s. Why computers were everywhere in the 1980s but Robert Solow could quip in 1987, “You can see the computer age everywhere but in the productivity statistics.” And why, in February 2026, a National Bureau of Economic Research study of 6,000 executives found that roughly 90% of firms say AI has had zero measurable impact on either employment or productivity.

Apollo chief economist Torsten Slok recently wrote that “AI is everywhere except in the incoming macroeconomic data.” The Solow paradox, resurrected for a new generation.

The J-curve is not a sign of failure. It is the signature of a technology that is genuinely transformative, one that requires so much complementary change that the short-term costs swamp the short-term gains. The teams that understand this push through. The teams that do not abandon the technology at the bottom of the curve, precisely when the crossover is closest.

The curve has three distinct phases that produce the dip, each with its own mechanism and its own timeline. Understanding them separately is what turns the J-curve from an unpleasant surprise into a manageable transition plan.

What Is the Supervision Tax?

Every piece of AI agent output requires human review. That review time comes from the same productive hours the agent is supposed to be augmenting. This is the supervision tax: the cognitive and temporal cost of watching an agent work.

The numbers are worse than most teams expect. A 2024 study by Uplevel examining software engineering teams using GitHub Copilot found no statistically significant increase in pull request throughput, while bug rates increased by 41%. Developers were accepting AI-generated code faster than they could properly review it.

Then came the METR study (2025), which recruited 16 experienced open-source developers working in large repositories they already knew well. The finding that made headlines: developers completed tasks 19% slower with AI assistance. But the detail that matters more is the perception gap. Before the study, participants predicted they would be 24% faster with AI tools. After the study, they still felt they had performed better. A 43-percentage-point swing between perception and measurement.

The supervision tax is not just time. It is cognitive load. Research on code review (IEEE, 2023) shows that reviewing unfamiliar code requires 1.5 to 2.5 times the cognitive effort of writing equivalent code from scratch. The reviewer must reconstruct the author’s reasoning, map it to the codebase, identify potential issues, and verify correctness, all without having built up a mental model through the act of writing. When the “author” is an AI agent, the opacity is even higher. Human authors leave contextual clues in naming conventions and commit messages. Agent output is technically correct-looking but contextually opaque.

Harvard Business Review researchers have identified what they call “AI brain fry”: the cognitive load from sustained supervision of systems that are mostly correct. A procurement analyst supervising an AI agent has a fundamentally murky assignment. Stay alert for mistakes you are not doing the work to see coming. This does not get easier with practice, because the nature of the mistakes changes as the agent improves. You are not watching for the same failures over and over; you are watching for novel failures in a system whose behavior shifts.

The supervision tax hits a crossover point when agent accuracy exceeds the threshold where review time plus correction time is less than manual execution time. For a 30-minute task with a 10-minute review cycle, the math crosses over around 75-80% agent accuracy. Below that, the agent is pure overhead. Most teams do not measure their threshold. They feel the overhead and blame the tool.

Why Does the Correction Loop Cost More Than Starting Over?

When an AI agent makes a mistake, the cost is not equivalent to a human making the same mistake. It is systematically higher. This is the correction cost multiplier, and it is the mechanism that makes the J-curve trough deeper than anyone expects.

When a developer writes buggy code, they have a mental model of what they were trying to do. Debugging their own mistake is a process of checking assumptions against reality. When an agent produces buggy code, nobody has that mental model. The developer must first reconstruct intent (what was the agent trying to do?), then diagnose the error (where did the approach go wrong?), then implement the fix, then verify the fix did not break other parts of the agent’s output, then re-test the whole change.

A task that takes 10 minutes manually becomes a 25-minute correction when an agent does it wrong. That is a 2.5x multiplier, and this is the conservative estimate. For complex, multi-file changes where the agent has made cascading decisions, the multiplier reaches 3-4x because the “understand intent” phase requires tracing the agent’s reasoning across multiple files and understanding how early decisions propagated into later ones.

The Stack Overflow 2024 Developer Survey found that 63% of developers using AI tools reported spending more time reviewing and modifying AI-generated code than they expected. The expected workflow was: agent generates, developer glances, developer ships. The actual workflow is: agent generates, developer studies, developer partially understands, developer rewrites portions, developer tests, developer discovers an edge case the agent missed, developer fixes the edge case, developer tests again.

GitClear’s 2024 analysis of 153 million changed lines of code found a doubling of “churn code” (code revised or reverted within two weeks) coinciding with the rise of AI coding assistants. The code was written faster and thrown away faster. Velocity without durability is not productivity. It is rework disguised as throughput.

This correction cost is where retries compound the damage. When an agent hits an error and retries, then hits a different error and retries again, it can produce a Frankenstein output that addresses three error messages and the original task simultaneously. The correction cost for that output can be 4-5x the original task. I have written about the infrastructure-level defense against this at The Real Cost of AI Agent Retries on the Govyn blog: retries are a symptom of the paradox, and agents that retry are agents in the correction loop.

This is Bainbridge’s irony in its purest modern form. The automation saved time on the 70-80% of cases where it worked correctly. On the 20-30% where it did not, it created work that was harder and more time-consuming than doing it from scratch.

How Does Trust Calibration Affect the J-Curve?

The third component of the dip is the most psychologically complex: learning when to trust the agent and when to verify.

Trust calibration research from CHI 2024 and a 2023 survey in CHI defines the concept precisely: calibrated trust is the alignment between a user’s subjective confidence in a system and the system’s objective reliability. Well-calibrated trust minimizes both overtrust (misuse) and undertrust (disuse). Poorly calibrated trust produces one of two failure modes, both of which are expensive.

Overtrust ships bugs. When developers accept agent output without sufficient review, subtle errors reach production. These are not compiler-catchable syntax errors; they are logic bugs, race conditions, edge cases, and security vulnerabilities that look correct at a glance but fail under specific conditions. The Uplevel study’s 41% increase in bug rates among Copilot users is consistent with systematic overtrust.

Undertrust negates the agent. When developers review every line of agent output with the same scrutiny they would apply to a junior developer’s first pull request, the review time exceeds the time savings from generation. The agent becomes pure overhead. This is the mode most teams start in immediately after deployment, which is why the J-curve dips.

The research reveals an important asymmetry: trust in AI builds slowly and breaks quickly. A 2020 study in Human Factors found that a single high-severity failure can reset trust calibration by weeks. A team that encounters one agent-generated production bug may revert to reviewing everything for two to three weeks before gradually rebuilding calibration. More recent work from Frontiers in AI (2025) confirms the pattern: “Trust formation, error impact, and repair in human-AI systems exhibit asymmetric temporal dynamics. Building trust takes more time compared to building trust in humans; however, when AI encounters problems, trust loss occurs more rapidly.”

This means the cost of a single bad agent output is not just the correction cost. It is the weeks of additional review overhead as the team’s trust calibration resets. The J-curve has a ratchet effect: the trough gets re-deepened by every significant agent failure.

Calibration takes 2-4 weeks of daily use for most teams. During that period, they oscillate between overtrust and undertrust, gradually converging toward the task-specific sweet spot: trust boilerplate, verify architecture, always verify security-relevant output.

When Do AI Agents Become Net Positive?

The crossover point, where cumulative agent-assisted productivity exceeds what the team would have achieved without agents, depends on three variables: agent accuracy, task domain, and team calibration speed.

Here is what the data shows across domains:

Metric	Coding Agents	Support Agents	Content Agents	Workflow Agents
Time to trough	2-3 weeks	1-2 weeks	1-2 weeks	3-5 weeks
Trough depth	-15 to -30%	-10 to -20%	-10 to -15%	-20 to -35%
Time to crossover	4-8 weeks	2-4 weeks	3-5 weeks	6-12 weeks
Steady-state gain	+25-40%	+30-50%	+20-35%	+40-60%
Trust calibration period	4-6 weeks	2-3 weeks	3-4 weeks	6-8 weeks

Sources: Composite from GitHub Copilot studies (2024-2025), Uplevel engineering team analysis (2024), METR task assessment framework (2025), industry reports from Harness and LinearB.

The crossover is not a single moment. It is a gradual transition from net negative to net positive, and it depends on the specific task mix. Support agents cross over faster because the tasks are more constrained and verification is quicker. Workflow agents take longer because the task surface is broader and the consequences of errors are more distributed across systems.

The steady-state gains are real. Teams that push through the J-curve consistently report the numbers above. The paradox is not that AI agents do not work. It is that they work on a different timeline than vendor marketing suggests. The NBER CEO survey found that executives forecast AI will increase productivity by 1.4% over the next three years, even as 90% report no impact yet. They are looking at the same J-curve from the macro level: the investment phase is real, the payoff is real, and the gap between them is where most organizations currently sit.

What Accelerates the Crossover?

The J-curve is not a fixed duration. Teams that understand the paradox and design their deployment accordingly can compress the dip substantially. Three levers consistently accelerate the crossover.

Structured feedback loops

Every correction the team makes is a training signal. Teams that systematically capture their corrections and feed them back into the agent’s configuration, through custom instructions, examples of desired output, and domain-specific constraints, see the agent’s error rate decline faster. Each correction that improves the agent reduces the future correction tax. The compounding effect is significant: a 5% improvement in accuracy per week means the crossover arrives weeks earlier.

This is the opposite of the “set and forget” anti-pattern. Research on AI agent deployment identifies the most common failure pattern: teams deploy agents, do not invest in feedback mechanisms, watch the agent make the same mistakes repeatedly, and conclude the technology does not work. The agent was not the problem. The absence of a learning loop was.

Graduated autonomy

Starting an agent at full autonomy is the highest-risk deployment pattern. Every task type the agent touches enters the supervision tax simultaneously. The team is overwhelmed with review work across every domain before they have calibrated trust in any domain.

The research-backed alternative is graduated deployment. Start with the lowest-risk, most constrained task type. Let the team build calibration on boilerplate code generation, standard test writing, or routine documentation. Once trust is calibrated for that domain (2-3 weeks), expand to the next domain. This staggers the supervision tax and produces a shallower trough.

Anthropic’s own research on measuring agent autonomy (2026) frames this as “autonomy level controls”: the ability to calibrate autonomy for different types of action in different contexts. GitLab’s study of agentic tool adoption found the same pattern: “Power users build trust with the tool over time, apply it to increasingly ambitious tasks, and benefit as the product improves.”

Observability

When a developer can see that the agent made three attempts before producing the current output, or that the agent’s token consumption was unusually high, or that the agent accessed 15 files when the task should only touch 3, they can calibrate their review effort accordingly. Without observability, every piece of agent output looks the same: confident and clean. The developer has no signal for how much trust to place in it.

Agent observability infrastructure that surfaces the agent’s process, not just its output, dramatically reduces the supervision tax. Reviewers can focus on the 20% of outputs most likely to contain errors rather than reviewing 100% uniformly. This single intervention can cut the effective supervision tax by more than half.

The monitoring gap is real. A “successful” HTTP 200 response from an agent might contain completely hallucinated data. An agent might technically be “up” while stuck in an infinite reasoning loop burning $50 per minute. Traditional monitoring does not catch these failure modes. Agent-specific observability is the difference between knowing your agent is running and knowing your agent is working.

What Delays the Crossover?

Three anti-patterns consistently extend the J-curve trough or prevent the crossover entirely.

No governance, no monitoring

The “set and forget” deployment model. Teams deploy agents, hand them API keys, and check back in a month. Without monitoring, they do not know the agent’s accuracy rate, they do not know which task types produce the most corrections, and they cannot build systematic trust calibration because they have no data to calibrate against.

Gartner predicts that over 40% of agentic AI projects will be canceled by end of 2027, with escalating costs and unclear business value as the primary drivers. These are overwhelmingly projects that deployed without governance infrastructure. The agent did work, but nobody could measure whether it was creating or destroying value, so the project got killed during the trough.

”It should be faster already”

Unrealistic timeline expectations are the leading cause of premature abandonment. Vendor marketing implies immediate productivity gains. When the J-curve produces a dip instead, leadership loses confidence and pulls the plug. According to Deloitte, 42% of companies abandoned at least one AI initiative in 2025, up from 17% the previous year, with the average sunk cost per abandoned initiative reaching $7.2 million.

The MIT Sloan research is even more stark: 95% of generative AI pilots fail to scale to production deployment. Many of these are not technology failures. They are expectation failures. The pilot worked, the dip appeared, and nobody had budgeted for the calibration period.

Adding more agents to fix the dip

When teams hit the J-curve and want to accelerate through it, a common instinct is to add more agents: an agent for code generation, another for code review, another for testing. The theory is that agent review of agent output will reduce the supervision tax.

This amplifies the paradox rather than solving it. Each additional agent carries its own supervision tax. The review agent needs to be reviewed. The testing agent’s test cases need verification. And agent-to-agent coordination introduces entirely new failure categories. I wrote about this extensively: the MAST study (NeurIPS 2025) found that 79% of multi-agent system failures come from coordination breakdowns, not capability gaps. Adding agents to escape the J-curve creates a deeper, wider J-curve with more variables and more coordination failure surfaces.

How Should Teams Plan for the Dip?

If the J-curve is predictable, it is plannable. Here is a practical framework for budgeting the transition.

Set expectations before deployment

Tell the team, and especially leadership, that the first 4-8 weeks will feel slower. Frame this explicitly: “We are investing in calibration. The investment pays off in month 2-3.” This prevents the panic response that kills most agent deployments during the trough.

Show the J-curve diagram. Give specific numbers: “We expect a 15-20% productivity dip for 4-6 weeks, followed by a 25-35% sustained gain.” Having a prediction that matches reality builds confidence that the plan is working, even during the dip.

Reduce sprint commitments during the calibration period

For the first 6-8 weeks, reduce sprint capacity by 20%. This accounts for the supervision tax and correction overhead without requiring the team to work longer hours or cut quality to maintain the pre-agent velocity. The velocity will return and exceed the baseline, but not in week one.

Measure the right metrics

Track these weekly during the transition:

Agent accuracy by task type: What percentage of agent output is accepted without correction? Break this down by task category to identify where the agent is net positive and where it is still net negative.
Review time per task: How long does the team spend reviewing agent output? This should decrease week over week as calibration improves.
Correction cost: When the agent gets it wrong, how long does the correction take? Track separately from review time.
Trust calibration velocity: How quickly is the team converging on the optimal trust point? Are there specific task types where calibration is stuck?

Define stage gates for autonomy expansion

Do not expand the agent’s scope until the current scope is net positive. If the agent is handling boilerplate code generation and the team’s review time for that task type has dropped to under 5 minutes, expand to test generation. If test generation reaches the same threshold, expand to refactoring. This prevents the compound supervision tax that comes from expanding into all task types simultaneously.

Budget for the trough, invest in the crossover

The total cost of the J-curve dip, measured in reduced productivity, is typically 2-4 weeks of a single developer’s output. This is the investment. The return, measured in sustained productivity gain, compounds every week after the crossover. At a 30% productivity gain, the breakeven on the calibration investment occurs within 6-8 weeks of reaching steady state.

The teams that reach the other side of the J-curve describe it as obvious in retrospect. The productivity gain is real, durable, and growing as agents improve. The teams that abandon agents during the trough spend the same calibration cost and get nothing for it. The worst outcome is to invest in the dip and quit before the crossover. The second worst outcome is to never invest at all.

The Macro Picture: Why the Paradox Shows Up in GDP

The Productivity J-Curve is not just a team-level phenomenon. It is why the macro data on AI looks so confusing right now.

Brynjolfsson, Rock, and Syverson’s 2021 model predicts exactly what we see in 2026: massive investment in AI ($250+ billion in 2024 alone, according to the NBER study), widespread adoption (70% of firms actively using AI according to the CEO survey), and approximately zero measurable productivity impact at the macro level.

This is not a paradox. It is the J-curve playing out at scale. Millions of teams are in the trough simultaneously. The complementary investments, process redesign, workforce retraining, organizational restructuring, governance infrastructure, are being expensed right now. The productivity returns have not materialized yet because the calibration period has not elapsed at scale.

The same pattern played out with every general purpose technology:

Electricity entered factories in the 1890s. Productivity gains did not materialize until the 1920s, after factories were redesigned around distributed electric motors rather than the old central-shaft layout. Thirty years of J-curve.
Computing became widespread in business in the 1970s and 1980s. Solow’s paradox (“You can see the computer age everywhere but in the productivity statistics”) persisted until the mid-1990s. Twenty years of J-curve.
AI agents reached mass adoption in 2024-2025. If historical patterns hold, the macro productivity gains will begin appearing in 2028-2030. Three to five years of J-curve.

The Brynjolfsson model adjusts for this: when intangible investments related to AI are properly accounted for, measured productivity is significantly higher than official statistics suggest. The investment is real. It is just not showing up in the numbers yet because the returns lag the costs by years, not quarters.

A Census Bureau working paper (2025) examining tens of thousands of U.S. manufacturing firms found the same J-curve at the firm level: AI adoption tends to hinder productivity in the short term, with measurable declines during the first year. But firms that persisted through the dip outperformed non-adopting peers in both productivity and market share over the medium term.

The firms that quit during the dip fell behind the firms that never adopted. They spent the investment cost and abandoned the return.

The Multi-Agent Amplification

When the individual agent J-curve is not painful enough, some teams decide to solve it by adding more agents. An orchestrator agent that coordinates specialist agents. A review agent that checks the coding agent’s output. A testing agent that writes test suites.

This is the single most common mistake I see in agent deployment. It does not compress the J-curve. It creates multiple simultaneous J-curves with coordination overhead on top.

I wrote about this in detail: DeepMind’s 2025 scaling study found that coordination gains plateau at 3-4 agents and produce negative returns above that threshold. The MAST study found failure rates between 41% and 87% across major multi-agent frameworks, with 79% of failures attributable to coordination breakdowns.

The compounding math is unforgiving. If each agent has a 20% error rate and the coordination layer has a 10% failure rate, the system error rate is not the sum. It is 1 - (0.8 x 0.8 x 0.9) = 42.4%. Each layer multiplies. And each agent’s supervision tax stacks. The review agent needs to be reviewed. The testing agent’s tests need to be verified. You have not eliminated the human review requirement. You have added layers between the human and the output, making the correction loop longer when something goes wrong.

Start with one agent. Get through its J-curve. Build calibration. Reach the crossover. Then, and only then, consider whether a second agent addresses a genuinely distinct capability gap that justifies the coordination overhead.

Bainbridge Was Right, and It Gets Worse

Bainbridge’s 1983 insight had one more layer that is uniquely applicable to AI agents: skill atrophy. The operators who monitored automated systems lost the ability to intervene effectively because they were no longer practicing the skills required for intervention. The automation that was supposed to make the system more reliable made the human safety net less capable.

With AI agents, the effect is amplified. In industrial automation, the human’s skills atrophy because they are not practicing. With AI agents, the human’s knowledge of their own system atrophies because the agent is making decisions the human used to make. A developer who delegates all boilerplate, routing, and standard patterns to an agent for three months gradually loses fluency in those areas of the codebase. When the agent produces a subtle error in code the developer used to own, their ability to spot it is degraded precisely because the agent has been handling it.

This is the deeper version of the supervision tax: not just the time cost of review, but the degradation of the reviewer’s ability to review effectively. The team becomes increasingly dependent on the agent while simultaneously becoming less capable of catching its errors.

Bainbridge wrote: “The operator is expected to monitor that the automatic system is operating correctly… but it is impossible to monitor a process you do not understand.” Replace “process” with “codebase” and “operator” with “developer” and you have a precise description of what happens to teams that over-delegate without maintaining their own understanding.

The defense is deliberate practice: regularly doing manual work in the areas the agent handles. Not because manual work is more efficient, but because it maintains the human’s mental model. This feels wasteful. It is the investment that keeps the supervision function viable.

The Interactive Calculator: Model Your J-Curve

The numbers vary by team, domain, and agent capability. This calculator lets you model the J-curve for your specific scenario.

AUTOMATION ROI TIMELINE CALCULATOR

Team Size

1850

Task Complexity

Current Manual Time per Task (minutes)

5m30m120m

Estimated Agent Error Rate

5%25%50%

Governance Maturity

Dip Depth

-22%

below baseline

Dip Duration

5 wk

to crossover

Breakeven

8 wk

cumulative ROI = 0

6-Month ROI

+31%

net productivity gain

Projected productivity curve based on your inputs. Actual results vary.

Model based on composite research data. For illustration, not prediction. Your mileage will vary.

What the Macro Data Actually Shows

Let me be direct about the state of the evidence in April 2026, because the picture is complicated.

The pessimistic reading: 90% of firms report zero productivity impact from AI. 42% abandoned at least one AI initiative in 2025. 95% of generative AI pilots fail to reach production. Global enterprises invested $684 billion in AI in 2025, and over 80% failed to deliver intended business value. These are not fringe statistics. They are from NBER, Deloitte, MIT Sloan, and S&P Global.

The optimistic reading: 74% of executives who deployed AI agents report achieving ROI within the first year. Among those reporting productivity gains, 39% have seen productivity at least double. Organizations deploying agentic AI systems report average returns of 171%. These are from Google Cloud, Salesforce, and enterprise deployment surveys.

Both readings are true. They are measuring different populations at different points on the J-curve. The firms reporting zero impact are in the trough. The firms reporting massive ROI pushed through it. The 42% that abandoned their initiatives quit during the dip. The ones reporting 171% ROI are the ones that persisted past the crossover.

This is not spin. It is exactly what the Brynjolfsson J-curve model predicts. At any given moment during a major technology transition, you will see both dramatic success stories from early crossover firms and dramatic failure stories from firms still in the dip. The same technology produces both outcomes, depending on whether the deploying organization understood and planned for the investment period.

The question is not whether AI agents create productivity gains. The evidence that they do, for teams that reach steady state, is strong. The question is whether your organization has the patience, governance, and planning to survive the trough.

What to Watch Over the Next Year

Agent management as a role. “AI agent manager” is emerging as a distinct organizational function. Not a developer who also uses agents, but someone whose primary responsibility is configuring, monitoring, calibrating, and optimizing agent deployments. This is Bainbridge’s irony expressed as an org chart: the automation created a new human role focused on managing the automation.

Observability as standard. The current generation of agents produces output with no signal about confidence, process, or difficulty. Next-generation observability tooling will surface the agent’s internal state, enabling faster trust calibration and lower supervision tax. Microsoft Azure’s observability best practices and dedicated agent monitoring platforms are pushing this forward.

Structured verification replacing human review. Instead of relying on human review for everything, the emerging pattern is automated verification: type checking, static analysis, property-based testing, and behavioral testing of agent output before it reaches the human reviewer. Each layer of automated verification that catches real errors reduces the supervision tax without introducing multi-agent coordination failures.

Calibration data as an asset. Teams are beginning to track agent accuracy by task type, codebase area, and complexity level, building quantitative models of when to trust and when to verify. This replaces gut feeling with data-driven calibration that can be shared across the team.

The Bottom Line

The automation paradox is not a reason to avoid AI agents. The productivity gains on the other side of the J-curve are genuine, substantial, and growing as the technology improves. It is a reason to deploy them with accurate expectations, appropriate investment in the calibration period, and respect for the forty-year-old insight that automating a task does not eliminate the human work around it. It transforms that work into something harder, rarer, and more important.

Bainbridge published her paper in 1983. Brynjolfsson formalized the J-curve model in 2021. The METR study measured the 19% slowdown in 2025. The NBER CEO survey confirmed the macro paradox in 2026. The evidence trail is forty years deep and internally consistent. Every automation wave follows the same pattern. Teams that recognize this navigate the transition. Teams that do not abandon agents at the bottom of the curve, right before the payoff.

Plan for the dip. Budget for the calibration. Measure your way through it. The crossover is real, and it is worth reaching.

The Research Papers

Paper	Year	Key Finding
Ironies of Automation (Bainbridge)	1983	Automation makes the remaining human tasks harder and more critical
The Productivity J-Curve (Brynjolfsson, Rock, Syverson)	2021	GPTs follow a J-curve: intangible investments front-load costs before gains materialize
METR AI Developer Productivity Study	2025	Experienced developers 19% slower with AI tools; 43-point perception gap
MAST Multi-Agent Failure Analysis	2025	79% of multi-agent failures from coordination, not capability
DeepMind Multi-Agent Scaling Study	2025	Coordination gains plateau at 3-4 agents; sequential reasoning degrades 39-70%
Uplevel Copilot Engineering Study	2024	No throughput increase; 41% bug rate increase with Copilot
GitClear Code Quality Analysis	2024	Doubled churn code coinciding with AI assistant adoption
NBER CEO Survey on AI Productivity	2026	90% of firms report zero measurable AI productivity impact
Census Bureau Manufacturing J-Curve	2025	AI adoption hinders short-term productivity; medium-term gains for persistent firms
Trust in Automated Systems	2020	Trust builds slowly, breaks quickly; single failures reset calibration
Code Review Cognitive Load	2023	Reviewing unfamiliar code requires 1.5-2.5x cognitive effort of writing it
Trust Calibration Survey	2023	Comprehensive framework for measuring trust calibration in human-AI systems
Anthropic Agent Autonomy	2026	Autonomy level controls: calibrate agent scope per context

Disclosure: I build AI orchestration systems and work with agent infrastructure professionally. The observations in this post come from direct experience deploying AI agents in production environments and from studying the academic literature on automation and human-machine teaming. The research papers are linked throughout; evaluate the evidence independently.

Mark Laursen

Advisor, founder, and executive producer with 25+ years building technology companies, gaming platforms, and entertainment products. Based in Portugal.

LinkedIn GitHub