Multi-agent systems are mostly hype. The patterns we've seen actually deliver value, plus the ones we'd avoid until the tooling is more mature.
The "agents collaborating with agents" framing is one of the trendiest in AI right now. We've experimented with it for some real production use cases. Some patterns deliver — most don't. This is the honest version of where we've found multi-agent systems valuable and where the hype outpaces reality.
Two interpretations:
Single LLM, multiple roles in sequence: a "planner" agent decomposes a task; a "worker" agent does it; a "checker" agent verifies. All are the same model called with different prompts.
Multiple models / multiple processes: independent agents running concurrently, communicating via a message bus or shared state, each making its own decisions.
In production, almost everything that calls itself multi-agent is interpretation #1. Interpretation #2 is mostly research; outside of niche cases it's hard to justify the complexity over a well-orchestrated single agent.
The patterns we've used in production:
Decompose-then-execute. A long, complex task is split into sub-tasks by one prompt, then each sub-task is handled by a focused prompt. This is helpful when:
For our document processing pipeline: one call extracts a list of sections, then parallel calls process each section. Total wall-clock latency drops 4-5x via parallelism.
Plan, then critique. First call generates a plan / answer. Second call critiques it. Third call revises based on critique. This costs 3x the tokens but consistently produces better output for hard reasoning tasks.
We use this for our agentic task runner. The "plan + critique + revise" loop catches errors that the model would miss in a single pass. About 12% of the planned actions get corrected by the critique step.
Tool-using agent + format converter. The tool-using agent (which can call APIs, do web searches, etc.) produces messy output. A second prompt formats the messy output into the structured response the user expects. Separates concerns; each prompt is shorter and more focused.
This pattern is just "multi-step prompting." Calling it "multi-agent" is partly marketing.
Patterns to avoid:
Free-form negotiation between agents. "Agent A and Agent B discuss until they agree." Costs explode (each "turn" is a call); the conversations meander; quality is unpredictable. We've never seen this beat a focused single-prompt approach in production.
Long-running autonomous agents that run for hours. Failure modes are hard to debug, costs are unpredictable, and the agent can do unexpected things. We have hard time and cost limits on every agent task.
"Specialist" agents for trivial sub-tasks. A "research agent" that does a web search, a "summarization agent" that summarizes the result. These could be one agent with two tool calls. Splitting them into multiple "agents" adds overhead without benefit.
Recursive agent spawning (an agent decides to spawn child agents which spawn child agents). Cost and complexity explode. Debugging is impossible. We bound depth at 1 — agents can use tools, but agents don't spawn other agents.
The general failure pattern: "agent" framing encourages thinking of the LLM as autonomous and powerful. In production, you usually want the opposite — tightly controlled prompts with bounded scope, explicit handoffs, and human checkpoints for anything consequential.
When multiple LLM calls work together, what passes between them?
The simple answer: structured data. A JSON object with the task, current state, and any artifacts produced.
What we DON'T do: have agents pass natural language to each other. "Agent A says: I think we should... Agent B responds: That's a good idea but..." is expensive and noisy. JSON or other structured formats are cheaper and clearer.
For one project, switching from agent-to-agent natural-language communication to JSON cut the token usage by 70% with no quality loss.
Multi-step agent systems need to track state:
We use a simple shared state object (basically a Python dict, persisted to Redis) that all "agents" (prompts) read from and write to. Each prompt's input includes the relevant subset of state; each prompt's output updates the state.
This is much cleaner than "agents passing each other their full context." The state is the truth; prompts read what they need.
A multi-step agent flow costs more than a single prompt. The breakdown for our document processing:
The 50% cost increase pays back via:
For tasks where latency or quality matters, the cost trade is acceptable. For high-volume cheap tasks, single-prompt is fine.
The plan-and-critique pattern is more expensive: ~3x the tokens. Worth it only for hard reasoning tasks where quality matters more than cost.
We've tried LangChain (and its agent abstractions), CrewAI, and AutoGen. Honest assessment:
LangChain agents are flexible but the abstraction has historically been heavy. Recent versions (LCEL) are cleaner but the learning curve is real. We use LangChain for some things; build directly with the OpenAI/Anthropic SDKs for most.
CrewAI is appealing if you want the "team of agents" framing. Worked OK in demos; we found it hard to debug and not noticeably better than building with primitives.
AutoGen has nice patterns for multi-LLM conversations. Best fit if you really want interpretation #2 (multiple models/processes). Most of our use cases didn't need that.
Build with primitives (OpenAI/Anthropic SDKs + your own state management) for production agents in our experience. The frameworks accelerate prototyping but their abstractions don't always fit production needs.
A short list of cases where we've found multi-step / multi-prompt patterns clearly worth it:
In each case, the pattern is simple: a sequence of focused prompts, each doing one thing. Calling the sequence "multi-agent" is mostly framing.
The "we'll have agents do everything" framing tempts teams to skip:
The right model is: agents for the messy edges where deterministic logic fails. Not agents for the bulk of the work.
For agent systems we've put in production:
Without these guardrails, agent systems are too risky. With them, they're manageable.
Start with a single well-engineered prompt. Multi-step adds complexity; only add it when you've proven the single prompt isn't enough.
Multi-step is just sequenced prompts. Don't overcomplicate it. "Multi-agent" framing often dresses up basic LLM workflows.
Hard limits on time, cost, recursion. Without them, agents will surprise you.
Structured state, not natural-language conversation. JSON between prompts is cheaper and clearer.
Human-in-the-loop for consequential actions. Don't let agents auto-execute. Propose, confirm, execute.
Log everything. Debugging multi-step agent flows requires seeing every step.
Build with primitives unless a framework fits. Frameworks help for prototyping; in production, the abstractions sometimes get in the way.
Multi-agent systems are real and useful for specific patterns. They're also the most-hyped corner of LLM tooling, full of demos that don't survive production constraints. The discipline is in resisting the marketing and using multi-step patterns where they actually help — which is fewer places than the demos suggest.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.
Evolve CI/CD toward autonomous pipelines that detect issues and roll back safely.
Explore more articles in this category
Embedding indexes degrade silently. The signals that catch drift, how often to re-embed, and the operational patterns we built after one quiet quality regression.
Streaming LLM responses is easy until the client disconnects, the model stalls, or the user cancels. The patterns that keep streaming responsive without leaking spend.
When LLMs can call tools that change real state, the design decisions that matter most are about what's gated, what's automatic, and what triggers a human checkpoint.
Evergreen posts worth revisiting.