Maestro
Open SourceA local multi-CLI fusion engine: fan one prompt across Opus 4.8, GPT-5.5, and Gemini 3.1 Pro, have Opus judge the answers, then synthesize a grounded response. It runs on a proven discipline layer: verified done-claims, surgical scope, and a research-backed multi-agent gate. Zero dependencies, opt-in, off by default.
Maestro Frontier is a local multi-CLI fusion engine. It fans one prompt out to a parallel panel of the AI CLIs already on your machine (Opus 4.8, GPT-5.5, Gemini 3.1 Pro), has Opus 4.8 judge their answers into a structured analysis of consensus, contradictions, unique insights, and blind spots, then writes a grounded synthesis that explicitly does not majority-vote. It is opt-in and off by default, so installing or upgrading changes nothing until you ask for it. Underneath, it runs Maestro’s proven discipline layer: completion claims carry a verification status backed by a real check, every changed line traces to the request, and long autonomous runs get checkpoints, iteration caps, and explicit end conditions. Two markdown files carry the discipline layer; a zero-dependency CommonJS engine carries the fusion. No SDK.
The Frontier Engine
Frontier is built from the AI CLIs already installed on your machine, so it adds no model API of its own. A preset defines the panel, and the judge and synthesizer default to Opus 4.8:
| Mode | Behavior |
|---|---|
off | Normal Maestro. The engine is never invoked and behavior is unchanged. This is the default. |
single <model> | Route the prompt to one local CLI and return its answer, with no panel and no judge. |
fusion <preset> | Full panel, then an Opus judge analysis, then a grounded Opus synthesis, with graceful degradation and a one-level recursion bound. |
You switch modes with /maestro:frontier in Claude Code or node frontier/cli.cjs from any shell. Presets run from opus-duo (two Opus runs, isolating the synthesis step), through opus-gpt (Opus plus GPT-5.5, the recommended default for bounded spend), up to frontier-trio (Opus, GPT-5.5, and Gemini 3.1 Pro). The judge compares the panel’s answers; it does not merge them.
Honest scope, measured rather than implied: the engine is built, unit-tested for degradation, recursion, budget, and the anti-majority rule, and verified end to end on real runs of single mode and the opus-gpt, opus-duo, and frontier-trio presets. The quality lift of local fusion is not yet benchmarked in this repo, so no lift is claimed. Two operational caveats stand. Headless web access differs per CLI: Codex is confirmed live, while Claude and Gemini are gated off in this build. And each cold Opus panel, judge, or synthesis call carries non-trivial cost, so small prompts and the opus-gpt preset keep spend bounded. The token budget cap is opt-in and disabled by default.
The discipline layer it runs on
Frontier runs on this, and it is the part this repo actually benchmarks. Most multi-agent systems add agents to make things faster. The research says the opposite: adding agents usually makes things worse. Maestro starts from that finding. It is a discipline layer your existing coding agent reads on startup: completion claims must carry a verification status backed by a real check, every changed line must trace back to the request, and long autonomous runs get checkpoints, iteration caps, and explicit end conditions. Multi-agent coordination exists behind a counted decision gate that routes most work to a single agent. Two markdown files. No dependencies, no SDK.
What it changes day to day
Four failure modes every agent user knows, and what Maestro does about each:
- “All done!” on code that was never run. Maestro requires a status token (VERIFIED, PENDING_REVIEW, UNVERIFIED, FAIL) backed by an actual type-check, lint, or test run. A pack of six Claude Code hooks enforces the discipline structurally where prose cannot: the doctrine-read guard denies wasteful doctrine re-reads outright (24 of 24 attempts denied across 18 benchmark runs, zero successful re-reads, oracle pass unchanged), and one added overclaim line took unsupported VERIFIED claims from 5 of 6 runs to 0 of 6 on the benchmark’s checker-less refactor task, measured before and after.
- Drive-by refactors. The surgical-scope rule: every changed line traces to what you asked for. No formatting sweeps, no unrequested “improvements”, no deleting code the agent could not verify was dead.
- Overnight runs that drift. Recurring and long-horizon work gets a checkpoint artifact, a success condition plus a hard cap declared up front, and a re-read of both on every wake. The Maestro repo’s own benchmark loops run under these rules.
- Agent sprawl. The decision gate counts files and concerns, writes a one-line verdict before the first edit, and keeps orchestration (Planner, Specialists, adversarial review) reserved for work that is genuinely too big for one pass.
When it pays
The overhead is measured, not implied: about 10% extra cost on a 10-module refactor and 38% on a 16-file feature versus a clean agent (n=9 medians, benchmark harness in the repo). What that buys depends on who is watching.
Supervised, interactive use: you are already the audit layer. On tasks a clean agent passes anyway, the data shows no outcome difference; the premium buys scope guarantees and honest status lines. Worth it when drive-by refactors and false “all done” claims cost you review time.
Unattended work is where the premium earns its keep. Overnight loops, scheduled runs, CI agents: nobody reads the transcript at 3am, and the close-out claim is all you have. Auditing a 16-file diff after the fact costs comparable tokens and cannot recover what the run never recorded, like whether any check actually ran, or whether that unrequested “improvement” was deliberate. Maestro spends the same money in-line, while the information still exists, and leaves a verdict line, a status token, and a checkpoint trail you can trust without replaying the run.
This is not a hypothetical regime. The Maestro repo maintains itself this way: its four most recent maintenance loops ran unattended under the same long-horizon rules, made 75 benchmark runs for $30.12 against $47 in ceilings, voided zero runs, and published their own retractions and errata with no human in the loop.
The first five minutes
Install is one copy step, and the same on Windows as on macOS/Linux: the optional hooks, status line, and test suite ship in both PowerShell and POSIX variants, with CI running the full suite on both platforms. Then give your agent a real multi-file task and watch for two signals:
- A gate line before the first edit, with counts that match the task you gave it:
GATE: files=6 concerns=3 -> single-agent ... - A status token at the end. Either VERIFIED naming the check that ran, or an honest refusal to claim it. A real close-out from the benchmark streams: “UNVERIFIED: no type-checker or linter configured in this project.”
If either signal is missing, the doctrine file is not being read; check that the files sit in the project root. These are the same two signals the benchmark scorer counts, so what you check by eye is exactly what the repo measures.
Architecture
The decision gate is the key. It counts the work, writes its verdict, and keeps most tasks single-agent with zero coordination overhead. Multi-agent coordination only activates when the task genuinely benefits from parallel execution or adversarial review. The bias is intentional: the research shows coordination overhead makes simple tasks worse, not better.
Measured, not marketed
The repo ships a reproducible A/B benchmark harness: 13 fixture tasks, an isolated config dir so global settings cannot contaminate either cell, and a verifier the agent never sees during the run. The published results include retractions. Headline efficiency numbers from early n=3 runs did not survive replication at n=9, and the README says so in plain text. What you read there was measured, and what was not measured is not claimed. The newest measurement isolates a single hook: the S1 gate-reminder alone (no other hook installed) moves a 16-file task from one multi-agent verdict in six runs to six of six with real specialist spawns, no change in oracle pass, and a cost gap inside run-to-run spread; spawning costs more and buys nothing measurable on a fixture a single agent already passes.
The benchmark measures the discipline layer, not Frontier. The quality lift of local fusion is a separate question this repo has not yet answered, and the landing page above says as much.
Portable Core, Thin Adapters
Maestro separates portable orchestration doctrine from runtime-specific adapters. The core logic lives in AGENTS.md and works across any agent runtime. Each runtime gets a thin wrapper that imports the shared doctrine and adds only what’s specific to that environment.
Claude Code: Subagents vs Agent Teams
Claude Code offers subagents and agent teams. Maestro’s Claude adapter automatically routes to the right one:
- Subagents (default): narrow, independent tasks where only the result matters
- Agent teams: long-running parallel workstreams where peer-to-peer coordination is materially useful
Agent teams are experimental and Claude Code-only. Maestro’s portable core uses the general concept of “specialists” which each runtime maps to its own execution model.
Why Not CrewAI / LangGraph / AutoGen?
Maestro is not one of those. CrewAI, LangGraph, and AutoGen are standalone multi-agent platforms: you install packages, write agent code, and deploy a process. Maestro is a discipline-and-orchestration layer for the AI coding agents you already run. You do not write agent code; you copy a couple of files and your existing agent gains verification rigor, scope discipline, and gated multi-agent coordination.
If you need a standalone multi-agent application with custom tools, APIs, and deployment pipelines, build it on one of those platforms. If you want your AI coding agent to handle complex tasks better without changing your workflow, use Maestro.
Research Foundation
The architecture is grounded in 700+ sources across computer science, library science, safety engineering, and knowledge theory.
Read the full analysis in Why Your Multi-Agent AI System Keeps Failing.