The Problem Isn't the Model. It's the Architecture Around It.

The limits of single-agent AI coding workflows are not model-capability problems. They are architecture problems. Context bloats, roles contaminate each other, state evaporates between sessions, and there are no error boundaries when one stage corrupts the next. The fix is the same discipline that prevents spaghetti code — single responsibility, bounded context, structured contracts — applied one level up, to the agents themselves.

Every new model release prompts the same question: is this the one that finally makes AI coding agents reliable for production software?

It is the wrong question. What keeps single-agent workflows from scaling is not the model. It is the architecture around the model — and by default, there isn't any.

The teams shipping real code through AI agents stopped trying to prompt their way past this. They decomposed the work — splitting responsibilities across specialized agents with bounded context, structured handoffs, and stateful orchestration. We ended up doing the same thing, and arrived at the same conclusion most teams eventually do: a single agent is the wrong unit of work for production software.

The Single-Agent Wall

Give a capable frontier model a real delivery task — "build this feature end-to-end" — and watch what happens over the course of a session.

The context bloats. The agent pulls in files to understand the domain, adds more for the tests, accumulates the design rationale, then the implementation, then the review. By the time it is on the fourth file, half of its context window is conversation history that no longer influences the current decision. The quality of the final code is determined by what the agent can still pay attention to — and that attention budget is finite.

Roles contaminate each other. The same context that says "write the simplest code to pass the test" also says "handle these edge cases" and "think about the architecture" and "remember the performance requirement from three messages ago." The agent is wearing four hats inside one reasoning step. The "simplest code" hat loses, because the edge-case hat is louder and has more tokens.

There are no error boundaries. When the agent outputs a broken test because it misinterpreted the spec, nothing catches the drift before the next step builds on top of it. Prose-to-prose handoffs inside one session are silent. A slightly wrong intermediate output propagates until the whole session is a house built on a bad foundation.

State evaporates. The next morning, the next session starts blank. The agent has no memory of the design tradeoffs, the architectural constraints, the half-finished slice. You paste a CLAUDE.md and hope. The workflow is single-shot by default.

These are not prompt-engineering problems. You cannot write your way out of any of them with a better system message. They are architectural limits.

Single coding agent with context bloat and role confusion on the left versus a fleet of bounded specialists with structured handoffs on the right — Figure 1: Single-agent failure modes vs. multi-agent architecture.

Figure 1: The single-agent workflow accumulates unrelated context, mixes roles, and loses state between sessions. A multi-agent architecture scopes context per agent and persists state outside any single conversation.

A Single Agent Is the Wrong Unit of Work

All four walls have the same root cause: one agent trying to be many things at once. The solution is not a bigger agent or a better prompt. It is fewer responsibilities per agent.

Birgitta Böckeler's harness engineering framework leans on Ashby's Law — a regulator must have at least as much variety as the system it regulates. A single agent doing the whole delivery workflow has an unbounded failure space: it can fail at planning, at coding, at testing, at reviewing, and the failure modes blend together so you cannot tell which one happened. No single harness covers all of that at once.

When each kind of judgment gets its own agent with its own small context, the failure space per agent becomes bounded. A test-writer can only fail at writing tests. A reviewer can only fail at reviewing. Each failure mode is small enough to have its own detection mechanism — its own sensor, its own fitness function, its own gate. The union of small, specialized harnesses covers ground no single harness could.

"Constrain the agent" and "decompose the agent" are the same move, just at different scales. Harness engineering constrains one agent. Multi-agent decomposition splits the regulation problem across several. You end up using both, for the same reason.

What a Working Multi-Agent System Looks Like

The setups we have seen work — ours and others' — share four properties.

Single responsibility per agent. Each agent owns one kind of judgment. Planning is a different kind of judgment than coding, which is different from reviewing, which is different from shipping. Collapsing them into one role produces what collapsing them at the code level produces — a god class. If the agent's name contains "and," it is doing too much.

Bounded context per agent. An agent receives only what it needs. The coder writing minimal code to pass a failing test does not need the meeting notes from planning. The reviewer does not need the internal reasoning the coder used. Context is filtered at every handoff. This is the primary quality and cost lever — more than model tier. An agent with a 5,000-token focused context beats an agent with a 50,000-token everything-context on both dimensions.

Structured contracts at every boundary. What flows between agents is data, not prose. The planner does not hand the coder a narrative — it hands over structured rules, examples, and acceptance criteria. The reviewer does not receive a summary — it receives a diff, a set of fitness-function results, and the specification that was supposed to be satisfied. Prose requires interpretation. Structured data makes intent explicit and failures loud.

Stateful orchestrator. The agents themselves are stateless. They read a skill, do their job, return a result. The workflow state — which phase, which gates have passed, what the current scope is — lives outside any individual agent, in a persistent store the orchestrator reads and writes. This is what lets work survive a session boundary, a model change, or a human pause.

The difference between a real multi-agent architecture and "I orchestrated an agent in a loop" is whether all four properties hold. Most "multi-agent" frameworks get single responsibility right and miss the other three. That is why they demo well and fall over in production.

An aside on sub-agents: an agent that spawns helpers for parallel work mid-session is not the same thing. Sub-agents inherit the primary's context, share one workflow, and die when the session ends. A multi-agent architecture is persistent — its state exists whether or not any agent is executing. You can use sub-agents inside it, but the persistence is what makes the work survive.

What This Looks Like In Practice

In our implementation, decomposing the delivery workflow ended up producing roughly thirty agents, organized into three categories.

Multi-agent topology: an orchestrator routes work across phase specialists, role specialists, and infrastructure specialists, all backed by a persistent workflow state — Figure 2: One worked example of multi-agent topology.

Figure 2: One worked example of a multi-agent topology. An orchestrator routes work to phase specialists (plan, test, code, review, ship), role specialists (bug-fix, mutation-test, dependency-review), and infrastructure specialists. A persistent workflow state sits beneath them, surviving every session boundary.

The phases map to the delivery value stream — plan becomes acceptance test becomes failing test becomes minimal code becomes refactor becomes review becomes ship. Each transition is a gate. Each gate has preconditions. The orchestrator enforces them. No agent skips a phase because it does not know there is a next phase — the orchestrator does.

The number isn't the point. The pattern is: one kind of judgment per agent, bounded context per agent, structured handoffs, persistent state. Your workflow might decompose into eight agents or fifteen or forty. The count is a consequence of where the "wearing two hats" boundaries fall in your delivery pipeline.

What to Build First

If you are running on one agent today and this resonates, do not rebuild into thirty overnight. Do this sequence.

Split the reviewer first. The highest-leverage split is pulling code review out of the same agent that wrote the code. The context that wrote the code is the worst context to judge it — the agent has already convinced itself the work is good. A dedicated reviewer agent, with its own skill set and no memory of the implementation reasoning, sees what the coder cannot.

Make state external. Move whatever decisions the agent is currently keeping in its context into a file the orchestrator reads and writes. A workflow-state file, a status field, a gate flag. If the work can survive the session ending, you have an orchestrator. If it cannot, you do not — no matter what the framework is called.

Then split by phase. Once the reviewer is separate and state is external, split out the planner, then the test writer, then the coder, then the shipper. You do not need thirty. You need enough that no single agent is doing two distinct kinds of judgment at once.

Resist the "one agent, better prompt" temptation. Every attempt we have seen ends at the same walls — context bloat, role contamination, cascading failures. A better prompt does not fix an architecture problem.

The Bottom Line

The model will keep getting better. That improvement lands differently depending on what it lands inside of.

Dropped into a single-agent workflow, a better model runs into the same four walls faster. Dropped into a multi-agent architecture with bounded responsibilities, structured handoffs, and persistent state, the same improvement compounds — because every specialist gets sharper at the one thing it does.

We have not found a way around this. The model is not the limit. The architecture around it is.

The Problem Isn't the Model. It's the Architecture Around It.

The Single-Agent Wall

A Single Agent Is the Wrong Unit of Work

What a Working Multi-Agent System Looks Like

What This Looks Like In Practice

What to Build First

The Bottom Line

Frequently Asked Questions

Why do single-agent AI coding workflows fail at production scale?

What is a multi-agent coding system?

What's the difference between sub-agents and a multi-agent architecture?

How do you prevent context bloat in a multi-agent coding system?

How many agents does a production-grade AI coding system need?

Ready to See This In Practice?

Share this article