16 June 2026 · 19 min read · By Mark Laursen
Frontier AI Performance at Home, Without the API Bill
Note: Maestro is open source and free to fork. The fusion numbers below are OpenRouter’s published Fusion benchmark, not my own measurement; Maestro runs the same method on local CLIs. Both the code and OpenRouter’s results are public, so you can check the work and disagree with me in the open.
The idea did not arrive as a flash. It accreted. The first piece was a dual-CLI runner I built inside my own AgentFactory, two model CLIs working the same task so I could compare them side by side instead of trusting whichever one I happened to open that morning. The second piece was FounderOS, the local-first operating environment I run my whole working day inside, where I watched my own usage patterns for months. I lean on Claude Code hard. I lean on the Codex CLI just as hard, deliberately, because I pay a flat rate and I want every dollar of it spent. What I kept noticing was the third seat. Gemini, and the experimental CLI next to it, mostly sat idle. Three frontier model CLIs on one machine, all on flat monthly subscriptions, and at any given moment I was driving one or two of them while the rest waited.
That is a strange way to own frontier-tier models: pay for three, run one. The question that turned into the thing I want to show you was simple. If I am already paying for these CLIs, what happens if I point all of them at the same prompt at once and make them argue?
This post is the answer. It is also the story of how a boring markdown file I have maintained for two years turned out to be the foundation the interesting part needed, and how a frontier model release in June, and the way that release then vanished, convinced me the bet was right.
What Maestro actually is, and how it got here
Maestro did not start as a fusion engine. It started as a discipline layer, and the discipline layer is still doing the unglamorous work underneath everything else.
The discipline layer is a short markdown file your coding agent reads at the start of every session. Not a framework you import, not a runtime, not a bus. A file. It says a completion report may not claim it verified anything unless a real check ran: a type-checker, a linter, the tests. It says every changed line must trace back to what you asked for, with no drive-by refactors and no deleting code the agent could not prove was dead. It says long autonomous runs get a durable checkpoint, an iteration cap, and an explicit end condition declared before the work starts. And it says multi-agent coordination only switches on behind a counted decision gate, because the research is consistent that adding agents usually makes things worse, not better.
I did not write all of that at once. It built up the same way the fusion idea did, one failure at a time. The decision gate came first, in late 2024, after I shipped one too many multi-agent designs that fell over at the seams. The verification rules and the status tokens landed in early 2025, when I got tired of debugging agent output that reported “complete” and had never been tested. The long-horizon section landed late in the year, after a scheduled run drifted across a context compaction and spent a night confidently building on a fact it had already forgotten. Each section is scar tissue from a specific failure. I wrote the longer versions of those failures in why I stopped using multi-agent frameworks, in done means done, and in what eighteen months of production agents taught me.
That is the foundation. It is not glamorous and it never was. The headline is the part that sits on top of it. But I want the order to be clear, because the order is the argument: the discipline came first, the fusion came second, and the fusion only works because the discipline keeps it honest. We will come back to that.
The week the frontier validated the bet
It was the morning of Tuesday, June 9, 2026. I was reading a release on a second monitor while my own repository sat on the first. Anthropic had just shipped what they called their next generation of intelligence for the hardest knowledge work and coding problems, Fable 5, paired with a sibling model on the same day. Half-buried in a paragraph about agentic capability was a line that stopped me. The model was now better at verifying its own work before declaring done, better at sustaining intent across long autonomous runs, better at refusing to claim a task was complete when the checks said otherwise.
I read that line three times. Then I switched monitors and opened my own doctrine file. I scrolled to the verification section, the part that says no completion report may claim verified status unless a type-checker, a linter, and the test suite have all passed. I scrolled to the long-horizon section, the part that mandates one durable checkpoint, dual termination conditions declared at the start, and re-grounding on every iteration. The doctrine had been sitting in my repo for two years. The announcement had been live for ninety minutes.
We were betting on the same thing, at different layers. Fable 5 wrote those rules into the weights. Maestro had written the same three into a markdown file the model reads at session start, two years before the release. The two are not competitors. They are the same bet placed at two different layers of the stack, and the convergence told me the diagnosis was right: the next bottleneck in agentic AI is not raw capability, it is discipline.
And then the access got pulled. The frontier model I had been reading about on June 9 is not one I can run today. Pricing moves, availability moves, and a model you rent by the token can be taken back, sometimes within weeks of landing. That is not a complaint. It is the single most useful thing that happened to this argument, because it drew the line I had been failing to draw for myself. A frontier you rent can disappear out from under you. A frontier you assemble from the CLIs already on your disk cannot. The doctrine layer survived the model going away, because the doctrine never lived in the model. That is the whole case for what comes next.
The Frontier Engine: fan, judge, synthesize
Maestro’s Frontier Engine is a local multi-CLI fusion engine. The shape is four steps, and naming them precisely matters because the fourth step is where most ensemble ideas go wrong.
First, it fans your prompt out to a panel of the AI CLIs installed on your machine. A preset defines the panel. Nothing leaves for an API the engine controls; it drives the same CLIs you already drive by hand.
Second, the panel answers in parallel. Each model works the problem independently, with no knowledge of what the others said, so you get genuinely different attempts rather than three variations on whichever answer landed first.
Third, a judge model reads the whole panel. Not to pick a winner. To produce a structured analysis: where the answers agree, where they contradict each other, where one model saw something the others missed, and where all of them share a blind spot. The judge compares, it does not merge.
Fourth, and this is the part that matters, it synthesizes one grounded answer that explicitly does not majority-vote. Majority voting is how ensembles launder three confident wrong answers into one confident wrong answer. The synthesis step treats a unique insight from a single model as potentially more valuable than a consensus all three reached lazily. Agreement is evidence, not a verdict.
Presets scale the panel to the spend you want. The opus-gpt preset runs Opus 4.8 plus GPT-5.5 and is the recommended default for bounded cost. The opus-duo preset runs two independent Opus passes, which isolates the value of the synthesis step and, as the benchmark shows, is strong on its own. The gpt-duo preset (aliased chatgpt-duo) does the same with two GPT-5.5 passes and runs its judge and synthesizer on Codex too, so it needs no claude at all. The frontier-trio preset runs Opus, GPT-5.5, and Gemini 3.1 Pro together. A custom preset takes one to eight of the known models. The judge and synthesizer default to Opus 4.8, and you override either with a flag. Three CLIs ship as adapters today, with Kimi, DeepSeek, GLM, and Qwen adapters following soon.
Does fusing actually beat using one model?
Here is the part that surprised me, and I want to be precise about whose numbers prove it, because they are not mine. In June 2026 OpenRouter shipped a Fusion API built on exactly this idea, a mixture of agents: fan a prompt across a panel of models, let a judge read every answer, then synthesize one grounded response. They benchmarked it on DRACO, a suite of a hundred deep-research tasks, and published the result. Every fusion panel outscored the single models inside it. Not as a tie-breaker or a marginal smoothing, but as the difference between the panel and the best model that went into it.
Read the chart from the runnable end first. A double-Opus panel, two passes of Opus 4.8 against the same prompt with a synthesis step on top, reaches 65.5 percent. Opus plus GPT-5.5 reaches 67.6 percent. The Opus, GPT-5.5, and Gemini 3.1 Pro trio reaches 68.3. Every one of those is a panel you can run tonight on subscriptions you already hold, and every one clears the bar the strongest solo models miss by ten to twenty points. The single bar above them, Fable 5 plus GPT-5.5 at 69.0, needs Fable 5, which has since been pulled, so leave it; the point is that the panels you can still run land in the same neighborhood as the frontier model that vanished.
Now look at where Claude Fable 5 lands on its own: 65.3 percent. The double-Opus panel, the model you already run used twice, sits one notch above it at 65.5. That is the convergence made concrete. The instinct I had back in June was right, you should want your local agent as smart as the model the lab just shipped, but the mechanism was wrong. You do not have to wait for the lab, and you do not have to keep paying for access that can vanish. You can fuse the models already on your disk and reach that tier today.
Two honest bounds, and the first one matters most: these numbers are OpenRouter’s, not mine. They measured fusion through their API, server-side, on a deep-research benchmark, and they put the headline at frontier-level quality for about half the cost. Maestro’s only change is the layer. It runs the same panel, judge, and synthesis across the CLIs already on your machine, so the marginal cost is the flat-rate subscription you have already bought rather than a metered API call. I am leaning on their measurement, not a head-to-head of my own local panels, so treat the exact figures as the direction of the result, not a promise about your workload. The structure is what carries from their API to your CLIs: a fused panel beats its best single member, and you can run that structure at home.
The part that actually saves money
I do not have an idle-machine story to sell you. I run Claude Code hot and the Codex CLI hot, most of the working day, on purpose. The reason fusion is a home story and not a data-center story is not spare capacity. It is the billing model.
API access is metered. You pay per token, every token, and the bill scales with exactly how hard your agents work. The harder you lean on them, the faster the meter spins. I have written before, in what eighteen months of production agents taught me, about how the cost arc of production agents only bends when you stop paying the vendor for what you can do with engineering instead.
The CLIs on your machine are not metered that way. They run under flat-rate subscriptions. You pay the same whether they run hot or sit waiting. Fusion is the move that converts a fixed monthly cost into frontier-grade output: the marginal cost of a fusion run is subscription usage you have already bought, not a fresh metered line item climbing with every call.
I will not pretend fusion is free. Each cold panel, judge, and synthesis call is real work and consumes real subscription usage, so a fusion run is more expensive than a single solo call. That is why the default preset is opus-gpt rather than the full trio, why the guidance is to keep prompts small, and why the token budget cap exists even though it is opt-in and off by default. The claim is not that fusion is cheap. The claim is that it spends subscription capacity you already own to reach a quality level that would otherwise cost you a metered API tier, or a wait for access you might not get. For the work that deserves a frontier answer, that trade is the best one on the board.
The discipline layer it runs on
None of this would matter if the synthesized answer lied about being done. The fusion engine is the headline, but the reason I trust its output is the layer underneath it, the discipline layer Maestro has shipped since 2024.
Fusion sits on top of that contract. The panel can be brilliant, but the synthesis still has to carry a status token, still has to trace its claims back to the request, still has to stop when the checks fail. Frontier quality without that discipline is just a more confident way to be wrong.
Here is the part Fable 5 settled for me. When Anthropic spent the frontier’s next marginal capability on verifying its own work, refusing to call a job done before it was, and holding intent across a long run, rather than on raw cleverness, they told the whole field where the value is. That is the same call Maestro made two years earlier, one layer up. The June release was not a threat to the discipline layer; it was the strongest endorsement it has ever had. The lab and the doctrine agreed: the next win is discipline, not horsepower.
And the discipline layer earns that endorsement in plain, daily ways. The agent runs the type-checker, the linter, and the tests before it claims a green build, so a “complete” you read is a “complete” that something checked. It traces every change back to what you asked for, so it does not quietly rewrite half the file on the way past. When it starts to drift toward a confident, unsupported claim, the contract catches it: the model either stops itself or names the check it skipped instead of papering over it. The result is a model that makes fewer mistakes, ships fewer hallucinations, and hands you a far smaller pile of work to undo later. It spends some extra tokens to get there, and it is the best trade in the stack, because the tokens you save by skipping the discipline come back as the 2 a.m. cleanup I wrote about in the automation paradox. I covered the verification half of this in detail in done means done.
Fusion inherits all of it. The synthesized answer is not just the smartest reading of the panel; it is one that had to clear the same done-is-done bar before it reached you. Smarter outcomes from fusing the models, an answer you can trust without re-checking from the discipline underneath: that pairing is what Maestro is really selling, and Fable’s release is what convinced me the second half of it was right all along.
Install it in five minutes
Maestro is opt-in and off by default. Installing or upgrading it changes nothing until you ask. It works across Claude Code, the Codex CLI, Gemini, Cursor, Cline, and Windsurf, because the portable doctrine lives in one file and each runtime gets a thin adapter that imports it.
On Claude Code or Claude Desktop, it is a native plugin and a two-line install run inside the tool:
/plugin marketplace add mbanderas/maestro
/plugin install maestro@maestro
Then arm the engine, point it at a preset, and give it a real prompt. Start with opus-gpt to keep spend bounded and watch the judge surface a contradiction the two models had between them:
/maestro:frontier fusion opus-gpt
/maestro:frontier status
/maestro:frontier off
On Codex, in the CLI or Desktop, it installs as a plugin from a Git-backed marketplace, the same two-step shape as Claude Code:
codex plugin marketplace add mbanderas/maestro
codex plugin add maestro@maestro
Restart Codex or open a new thread, then enable Maestro and trust its bundled hooks, because Codex runs plugin hooks only for plugins you trust. From there you arm it in plain language: tell Codex Use maestro-frontier to turn on chatgpt-duo for this project. The chatgpt-duo preset is the Codex-only panel, two GPT-5.5 passes judged and synthesized on Codex, so it needs no Claude on the machine at all. With the plugin trusted and a preset armed, an ordinary prompt in Codex runs the panel underneath and the answer comes back as the grounded synthesis; one off turns it back into a plain pass-through. This is a Git-backed marketplace for now, not OpenAI’s public plugin directory, which is still on the way.
Other tools follow the same shape: npx github:mbanderas/maestro install --target cursor, or gemini, cline, windsurf. If you want the bare maestro command on your PATH, install it once globally with npm install -g github:mbanderas/maestro and restart the tool. The whole thing is on the Maestro repository, MIT licensed, zero runtime dependencies, with its own benchmark harness included, the hidden-oracle A/B suite for the discipline layer, so you can reproduce that part yourself.
The model list is moving too. A fusion of Gemini 3 Flash, Kimi K2.6, and DeepSeek V4 Pro already reaches 64.7 percent on the benchmark, knocking on the frontier tier with no proprietary frontier model in it at all. Panels built entirely from cheaper and open models that still clear the bar are the direction of travel.
What I am still unsure about
The fusion benchmark is not mine. It is OpenRouter’s, run through their API on a deep-research suite, and I am leaning on it rather than a measurement of my own local panels. I believe the structure carries from their API to local CLIs, because the method is the same panel, judge, and synthesis either way, but I have not published a head-to-head of the two, and you should not read the exact percentages as a promise about your workload. Run your own tasks before you bet a budget on the decimal places.
The cost story depends on your subscriptions. If you already pay for the CLIs, fusion is close to free capacity. If you do not, the math changes, and a metered API panel could erase the advantage. The shape of the argument holds for the operator who already runs frontier CLIs under flat-rate plans. It is weaker for a team that would have to buy seats specifically to fuse them.
And the synthesis step is the youngest part of the system. The discipline layer has two years of edits behind it. The fusion engine is unit-tested for graceful degradation, recursion bounds, the budget cap, and the anti-majority rule, and it is verified end to end, but it has not lived through two years of my mistakes yet. I trust the foundation more than I trust the roof, and I am telling you that on purpose.
Why this is the bet worth making
Frontier AI performance used to be a thing you rented from a lab, by the token, on a meter that never stopped, with access that could change under you. June proved that last part is not hypothetical. The model that validated this whole thesis from inside the weights is one I cannot run today. The models still on my own machine got to the same tier, and I am already paying a flat rate for them. Maestro’s Frontier Engine is the argument that you can fuse what you already own into an answer that beats any single model in the panel, pay for it with subscriptions already on your card, and keep the whole thing honest with a discipline layer you can read.
If you ship agents in 2026, the fifteen minutes it takes to install the plugin, arm the opus-gpt preset, and watch the judge surface a contradiction the two models had between them is the cheapest experiment on your desk this week. Clone the Maestro repository, run the benchmark on your own work, and disagree with my numbers in public. The point was never the exact percentage. The point is that the frontier was sitting on your machine the whole time, and you were only using a fraction of it.
Disclosure: I build Maestro (open-source AI orchestration: a local multi-CLI fusion engine on a verification-first discipline layer) and Govyn (agent governance infrastructure). I have a direct interest in both. Maestro’s own discipline benchmark ships in the repository, and the fusion numbers above are OpenRouter’s published results; check both and draw your own conclusions rather than taking mine.
Advisor, founder, and executive producer with 25+ years building technology companies, gaming platforms, and entertainment products. Based in Portugal.