May 23, 2026 · 12 min read

Multi-Agent Systems Are Messier Than You Think

Compound reliability decay in sequential agent chains

Do the math before you build.

0.95^10 = 0.599. At 95% per-step reliability (which is optimistic), a 10-step sequential pipeline succeeds 60% of the time. At 20 steps: 36%. You started with agents that fail 1 in 20 times. You ended with a system that fails nearly two-thirds of the time overall.

Google DeepMind research (December 2025): "unstructured multi-agent networks amplify errors up to 17.2x compared to single-agent alternatives." Errors don't cancel — they cascade. Gartner projects 40% of agentic AI projects canceled by end of 2027 due to reliability issues. State management is the named cause.

Tool call loops and the cost spiral

Three root causes cover 90% of infinite loops: missing max_turns, a termination function that never returns True, a system prompt with no clear "done" signal.

The cost: each loop step accumulates context. Turn 10 has all the history of Turns 1–9 in its context window — token count compounds. An unconstrained agent solving a software engineering issue costs $5–8 per task (Stevens Institute). Reflexion loops can consume 50x the tokens of a single linear pass.

Production guardrails that work:

def run_agent(task: str, max_turns: int = 20) -> str:
    for turn in range(max_turns):
        result = agent.step(task)
        if result.done:
            return result.output
        # convergence detection: >85% similarity across 3 consecutive states = stuck
        if turn >= 3 and is_converged(recent_states):
            break
    # strip tools, force synthesis
    return agent.synthesize(f"Give your best answer now. Steps used: {turn+1}")

Hard cap at 15–25 steps. At limit: strip tools from the next call and force synthesis. 300-second wall-clock timeout covers hanging tools and pathologically long generations.

Orchestration patterns

Five patterns worth knowing:

Sequential: deterministic ordering, easy to reason about, dies fast in long chains. Keep under 5 steps; add verification checkpoints between each.
Fan-out / Fan-in: one orchestrator dispatches to N workers in parallel, aggregator merges (vote, weighted merge, or LLM synthesis). Best for time-sensitive multi-perspective analysis.
Group chat: agents participate in a shared conversation thread; good auditability since everything is in one place. Agents typically read-only — no tool execution.
Handoff: one agent active at a time, hands off to a specialist. Classic customer support routing.
Hierarchical supervisor: root → domain leads → workers. Best domain isolation. Can be nested. LangGraph's multi-agent docs call this the canonical production pattern.

AdaptOrch (Feb 2026) benchmark: adaptive topology selection achieved 22.9% improvement over any single fixed topology. The router chose: 62% hybrid, 24% parallel, 14% hierarchical. Pick the pattern for the task, not because it's your default.

LangGraph vs AutoGen vs CrewAI — honest

| | LangGraph | AutoGen | CrewAI | |---|---|---|---| | Maturity | v1.0 GA Oct 2025 | Maintenance mode | Active, pre-v2 | | Debugging | LangSmith: node-by-node state diffs, time-travel, replay | Conversational logs only, no graph viz | Painful for 5+ agent pipelines | | Cost control | 40–60% savings from model mixing | Conversation summarization | Task-scoped context | | Complex task completion | 62% (8+ step tasks) | 58% | 54% | | Production reference | Klarna, Uber, LinkedIn | Microsoft internal | PwC | | Fatal gap | Single-process; no distributed task queue | In maintenance | No durable execution mid-ReAct |

AutoGen is in maintenance mode. Microsoft's own statement. The replacement is Microsoft Agent Framework (MAF), public preview October 2025. AG2 is the community fork. Don't start new production systems on AutoGen.

LangGraph checkpointing ≠ durable execution. This is the gap Diagrid identified:

Checkpointing saves state. But if the process crashes, no watchdog notices — the workflow is dead until something external triggers recovery. At hundreds of concurrent workflows, manual recovery requires custom infrastructure. True durable execution (Temporal/Inngest-style) automatically detects failure, resumes indefinitely, and restores local variables through replay — with zero recovery code in application logic.

LangGraph is still the right choice for most production agent systems. But when you need guaranteed completion semantics, combine it with a durable execution layer or Bedrock AgentCore Runtime.

Prompt injection propagates

Testing 17 LLMs: 82.4% execute malicious tool calls when requested by a "peer agent" — vs 41.2% for direct user injection. In multi-agent: one agent's output containing the injected payload becomes the next agent's trusted input. One injection can propagate to 48% of co-running agents.

EchoLeak (CVE-2025-32711, CVSS 9.3, June 2025): zero-click prompt injection in Microsoft 365 Copilot via a malicious email parsed by the agent's RAG step. The payload didn't need to trick a human — only the retrieval system.

# wrong: trust inter-agent messages
response = downstream_agent.call(upstream_agent.output)

# right: sanitize at every boundary
sanitized = strip_injection_patterns(upstream_agent.output)
response = downstream_agent.call(sanitized)

Principle of least agency (OWASP Agentic Applications Top 10, Dec 2025): autonomy bounded by default, not granted. Human-in-loop approval for high-impact actions: file writes, payments, external API calls with side effects.

What to trace

70% reduction in MTTR using distributed tracing vs log-based approaches.

Per-node minimum: token counts, latency, state before and after each checkpoint. Tool call inputs and outputs (not just "tool was called"). Agent-to-agent message payloads (this is the injection surface). Context window utilization per agent — starvation detection when one agent consumes 80% of available context. Retry counts and which nodes failed. Correlation IDs in every message for cross-agent trace linking.

Langfuse (MIT since June 2025)  → multi-framework, OTel-native, low vendor lock-in
LangSmith                       → deepest LangGraph integration; 10x more expensive for 1-year retention
Arize Phoenix (self-hosted)     → ML-grade rigor, data residency requirements

The AWS event-driven stack

EventBridge (event routing)
  → Lambda (stateless agents, auto-scale, pay per invocation)
  → Fargate (stateful agents, long-running, persistent compute)
  ↓
Step Functions (multi-step pipelines, built-in retry + exponential backoff)
DynamoDB (LangGraph checkpoint backend)
SQS FIFO (exactly-once semantics between agents)
AgentCore Runtime (Bedrock managed execution environment)
  ↓
X-Ray / Langfuse (distributed tracing across agent boundaries)

Stateless fan-out: EventBridge → SNS → N Lambda functions (one per agent), results aggregated in DynamoDB. Costs effectively zero at idle.

For exactly-once semantics on agent-to-agent messages: SQS FIFO queues. This eliminates the duplicate execution / state corruption issue that race conditions introduce as agent count grows (N(N-1)/2 potential interaction pairs at N agents).