On this page
- TL;DR
- Why multi-agent at all
- The five patterns
- 1. Orchestrator-worker
- 2. Supervisor (router)
- 3. Hierarchical
- 4. Swarm / peer-to-peer
- 5. Pipeline
- A decision tree
- The frameworks (honest takes)
- LangGraph
- CrewAI
- AutoGen
- OpenAI Agents SDK
- Anthropic's reference patterns
- "No framework"
- Picking models for multi-agent
- State management — the actual hard part
- Observability for multi-agent
- Eval for multi-agent
- Common failure modes
- When to NOT use multi-agent
- What's coming next
- The summary
TL;DR
- Multi-agent systems are useful when a problem genuinely decomposes into specialist roles. They are slow, expensive, and over-applied otherwise.
- The five patterns worth knowing: orchestrator-worker, supervisor (router), hierarchical, swarm/peer-to-peer, and pipeline.
- Frameworks: LangGraph is the serious choice for stateful graphs. CrewAI is the fastest path to a working demo. AutoGen is strong for conversational multi-agent setups. OpenAI Agents SDK and Anthropic's reference patterns are the lightweight defaults.
- Most "multi-agent" systems would be better as a single well-instrumented agent with a planning step. Try that first.
- Eval gets harder, not easier, with multiple agents. Plan for it from day one.
Why multi-agent at all
A single agent loop with 12 tools can solve a lot. So why split into multiple agents?
Three legitimate reasons:
- Role specialization improves quality. A "researcher" with web tools and a long-context model, plus a separate "writer" with a fast model and stylistic instructions, beats one agent doing both.
- Tool isolation reduces confusion. An agent with 40 tools is worse than four agents with 10 each. Smaller toolsets, better accuracy.
- Parallelism cuts wall time. Independent subtasks (research five competitors, draft five sections) run concurrently.
Bad reasons people pick multi-agent:
- "It feels more sophisticated." It isn't. It's just more moving parts.
- "The framework demo had it." Demos optimize for impressiveness. Production optimizes for reliability.
- "We want emergent behavior." Emergent behavior in production is a euphemism for unexplained behavior. Don't ship it.
If you haven't read what AI agents are and how they actually work, start there. This post assumes you know the basics.
The five patterns
1. Orchestrator-worker
One "lead" agent decomposes the goal, dispatches subtasks to worker agents, and synthesizes results.
- Good for: research crews, content generation pipelines, data extraction across many sources.
- Failure mode: the orchestrator's plan is wrong and workers execute it faithfully. Validate plans before dispatching.
- Tip: keep workers stateless and idempotent. The orchestrator owns memory.
This is the most useful and most common pattern in 2026.
2. Supervisor (router)
A lightweight router classifies the input and dispatches to one of N specialist agents. No collaboration; just routing.
- Good for: customer support (billing vs. technical vs. sales), code assistants (Python vs. SQL vs. infra), triage workflows.
- Failure mode: ambiguous inputs that should hit two specialists.
- Tip: the router can be a small, fast model. Don't waste Opus on classification.
Half of "multi-agent" systems in the wild are really just this. That's fine.
3. Hierarchical
Orchestrator-worker, but workers are themselves orchestrators. Trees of agents.
- Good for: very large problems with natural recursion (org-wide research, codebase-wide refactors).
- Failure mode: debugging is brutal. A bug at depth 4 produces a stack trace nobody wants to read.
- Tip: rarely worth it. Cap depth at 2. If you're going deeper, you probably want a workflow engine, not nested agents.
4. Swarm / peer-to-peer
Multiple agents share a workspace (a doc, a queue, a state object) and coordinate without a central boss.
- Good for: research and exploration where the right division of labor isn't known upfront.
- Failure mode: stalls, duplicate work, race conditions on shared state.
- Tip: add a coordination protocol — explicit lock or claim semantics — or you'll spend more time debugging than shipping.
Cool in papers. Hard in production. Use sparingly.
5. Pipeline
A → B → C. Each agent has a fixed role; output of one feeds the next.
- Good for: ETL-like workflows (extract → transform → validate → load), content production (draft → edit → fact-check → publish).
- Failure mode: error propagation. Bad input to step 2 ruins everything downstream.
- Tip: add validation between steps. Treat each handoff as a contract.
The most "boring" pattern, and often the most reliable.
A decision tree
Before you reach for a framework:
- Is the task one model + N tools that finishes in under 20 steps? Single agent. Done.
- Are the steps fixed and the failure modes well known? Workflow, not agent. Use a job runner.
- Does the work split into 2-4 specialist roles with clear handoffs? Orchestrator-worker or pipeline.
- Are inputs heterogeneous and routed to different specialists? Supervisor.
- Is the problem genuinely exploratory with unknown decomposition? Swarm, but be ready for pain.
If after this exercise you still want multi-agent, then pick a framework.
The frameworks (honest takes)
LangGraph
LangChain's stateful graph runtime. Nodes are agents or tools, edges are transitions, state is a typed object that flows through.
- Strengths: explicit state, branching logic, persistence, time-travel debugging, human-in-the-loop nodes.
- Weaknesses: verbose, opinionated, needs you to think in graphs from day one.
- When to use: serious production multi-agent systems where state and control matter more than time-to-demo.
The only framework I'd recommend for an enterprise build today.
CrewAI
Define agents with roles and goals, define tasks, run a "crew" that executes them.
- Strengths: great DX, fastest path to a working demo, strong defaults.
- Weaknesses: abstraction leaks under pressure, harder to debug subtle issues, opinionated about how agents should behave.
- When to use: prototypes, hackathons, internal tools, teams that don't want to build orchestration plumbing.
Excellent for getting moving. Reach for LangGraph when you outgrow it.
AutoGen
Microsoft Research's conversational multi-agent framework. Agents talk to each other in a chat-like loop.
- Strengths: great for conversational patterns, group chat, debate-based problem solving.
- Weaknesses: the chat metaphor doesn't fit every workflow; can produce verbose, expensive runs.
- When to use: when "agents discussing" is genuinely the right model — code review crews, research debates, brainstorming.
Strong but niche. The conversational metaphor is a feature, not a default.
OpenAI Agents SDK
OpenAI's official agents framework. Lightweight handoffs, tracing, and tool support.
- Strengths: simple, well-documented, excellent tracing UX.
- Weaknesses: OpenAI-flavored; tightly coupled to their ecosystem; less powerful than LangGraph for complex flows.
- When to use: OpenAI-only stacks; teams wanting a sane default with great observability.
Anthropic's reference patterns
Anthropic publishes reference implementations rather than a framework. The "Building effective agents" essay is the bible.
- Strengths: opinionated, simple, model-agnostic, no framework lock-in.
- Weaknesses: you build it yourself.
- When to use: when your team is strong enough to wire it up, and you'd rather understand every line than import abstractions.
This is the path most experienced agent-builders eventually take.
"No framework"
A surprising number of production multi-agent systems are 300-800 lines of Python or TypeScript using only the provider SDK and a queue.
- Strengths: total control, easy debugging, no abstraction tax.
- Weaknesses: you reinvent state management, retries, observability.
- When to use: when your patterns are simple (supervisor, pipeline) and your team is senior.
Don't be embarrassed by this option. It often wins.
Picking models for multi-agent
Mix aggressively. The cost difference between Opus and Sonnet across a multi-agent run can be 5-10x.
- Routers and classifiers: Sonnet 4.6, GPT-5 mini, or Gemini 2.5 Flash. Fast and cheap.
- Workers doing routine tasks: Sonnet 4.6 default. Opus only when quality regressions show up.
- Planners and orchestrators: Claude Opus 4.7 or o3. Planning quality compounds across the run.
- Long-context aggregators: Gemini 2.5 Pro for million-token synthesis.
- Code-heavy specialists: Claude Opus 4.7 or Claude Sonnet 4.6.
A good rule: the orchestrator gets the smart model, workers get the cheap one. Inverting this is a common mistake.
State management — the actual hard part
In a single-agent loop, state is the conversation. In a multi-agent system, state is a graph and you have to design it.
Key decisions:
- What's shared vs. private? The orchestrator's plan is shared. A worker's tool-call history is usually private.
- Where does state live? In-memory (fragile), Redis (good for ephemeral runs), Postgres (durable), a workflow engine like Temporal (when it gets serious).
- How do you handle partial failure? Worker 3 of 5 crashes. Do you retry just that worker? Restart the orchestrator? Fall back to a degraded plan?
- Are runs resumable? Long-running multi-agent runs need checkpointing. Otherwise a network blip is a 30-minute waste.
LangGraph and Temporal solve a lot of this for you. If you roll your own, plan for at least a sprint of state-management work.
Observability for multi-agent
A single-agent trace is linear. A multi-agent trace is a tree (or worse, a DAG). You need:
- Per-agent traces. Each agent's prompts, tool calls, outputs.
- Cross-agent correlation. A run ID that ties everything together.
- Step counts and budgets per agent. So you can see who's burning tokens.
- Replay. Take a failed run, change one prompt, re-run from a checkpoint.
LangSmith, Langfuse, and Helicone all support this in 2026. Pick one early. Don't wait until you have 200 failed runs to start instrumenting.
Eval for multi-agent
Eval gets harder because:
- The output of one agent is the input to another. Did the system fail because of the writer or because the researcher gave it bad data?
- Many runs have multiple "correct" decompositions. Hard to score.
- Cost-quality tradeoffs are now multidimensional — you can spend more on the orchestrator or more on workers.
Practical approach:
- Eval each agent in isolation with synthetic inputs. Catches regressions in individual roles.
- Eval end-to-end with held-out real cases scored by a human-graded rubric or a strong LLM judge.
- Track decomposition quality separately from execution quality. They fail differently.
- Watch cost-per-successful-run as the primary efficiency metric.
If you can't define what "successful run" means, you cannot ship a multi-agent system. Stop and define it.
Common failure modes
- Orchestrator hallucinates a worker. "Dispatch to the LegalAgent" — there is no LegalAgent. Constrain dispatch via tool schema, not free text.
- Workers loop on each other. A and B keep calling each other. Add cycle detection.
- Cost runaway in swarms. No one's accountable for the budget. Set a global cap per run.
- Stale shared state. Worker reads state, takes 60 seconds, writes back stale value. Use optimistic concurrency or locks.
- Silent specialist drift. Researcher's outputs degrade over time because nobody re-evals it. Schedule eval runs.
When to NOT use multi-agent
- Latency-sensitive UX. Multi-agent runs in seconds, sometimes minutes. Users notice.
- Strict cost budgets. A multi-agent run can be 5-20x a single-agent equivalent.
- Regulated workflows. More agents = more surface area to audit.
- Small teams. The operational burden is real. One person maintaining a 6-agent system is one person doing two jobs.
What's coming next
- Tighter MCP integration. Specialists that share tools via MCP instead of duplicating client code.
- Better planners. o3 and successors are noticeably better at decomposition than 2024 models. Orchestrators get smarter.
- Workflow + agent hybrids. Temporal-style durable execution wrapping agent calls. Best of both worlds.
- Cheap reasoning. As reasoning models drop in price, "let the orchestrator think harder" becomes routine.
For the broader picture on where this is headed, see 2026: the year of agentic AI.
The summary
- Multi-agent is the right tool for specialization, isolation, and parallelism. Otherwise prefer a single agent.
- Five patterns: orchestrator-worker, supervisor, hierarchical, swarm, pipeline. Most production wins are orchestrator-worker or pipeline.
- LangGraph for serious builds, CrewAI for fast demos, AutoGen for conversational patterns, no-framework when you can.
- Mix models aggressively. Smart orchestrator, cheap workers.
- Eval and observability matter more, not less. Build them in from day one.
Multi-agent systems are powerful. They are also a great way to ship a beautiful, expensive, broken thing. Build the boring single-agent version first.
Designing a multi-agent system? NovaKit lets you compare Claude Opus 4.7, GPT-5, Gemini 2.5 Pro, and o3 side-by-side so you can pick the right model for each role — without juggling six dashboards.