Three Governance Gaps Nobody Instruments in Multi-Agent Systems
Three independent teams arrived at the same conclusion this week: multi-agent systems fail silently because nobody instruments delegation, escalation, or reputation. Here are the practical instrumentation points.

Three Governance Gaps Nobody Instruments in Multi-Agent Systems
Three independent teams published this week about the same problem. Different frameworks, different use cases, different communities. Identical conclusions.
Waxell published a delegation governance framework for autonomous agent networks. Marques and Asqav released findings on escalation failures in production multi-agent deployments. And levelsofself published “The Nervous System,” a design pattern for behavioral reputation across agent clusters.
None of them cited each other. They all arrived at the same three blind spots.
Multi-agent systems fail silently — not because individual agents are broken, but because nobody instruments the governance layer between them. The agent runtime gets monitored. The tool calls get logged. The LLM responses get evaluated. But the decisions about who delegates to whom, when to escalate, and which agents have earned trust happen in a void. No logs. No policies. No history.
These are not edge cases. These are the load-bearing walls of any system where agents coordinate. And in most production deployments, they are invisible.
Gap 1: Delegation Without Audit
When Agent A delegates a task to Agent B, something consequential just happened. A scope of authority was transferred. A chain of responsibility was extended. And in most multi-agent systems, there is no record that it occurred.
This is the finding Waxell’s framework addresses directly. In their analysis, delegation is the single most common governance action in multi-agent workflows — and the one with the least instrumentation. The typical pattern: an orchestrator agent decomposes a task, assigns subtasks to worker agents, and the system proceeds. If a worker agent makes a bad decision three hops down the chain, there is no trail back to the delegation that authorized it.
The failure mode is not hypothetical. Consider a research agent that delegates summarization to a downstream agent, which in turn delegates fact-checking to a third. The fact-checker hallucinates a citation. The summarizer passes it through. The research agent publishes it. Post-mortem, you can see the bad output. You cannot see who delegated what to whom, under what constraints, with what authority.
The instrumentation point: Every delegation event needs a durable, timestamped record. Minimum fields:
- Delegating agent identity — not just a name, a verifiable identifier
- Receiving agent identity — same standard
- Scope of delegation — what the receiving agent is authorized to do
- Constraints — what the receiving agent is explicitly not authorized to do
- Parent delegation ID — linking to the chain that authorized this delegation
- Expiration — when the delegation authority expires if not explicitly revoked
This is not a nice-to-have observability feature. This is the difference between a system you can debug and one you cannot. When something goes wrong three delegation hops deep, you need to reconstruct the chain. Without these records, reconstruction is impossible. You are left staring at agent outputs and guessing.
Gap 2: Escalation Without Policy
When an agent encounters a situation beyond its defined scope, what happens next defines the reliability of your entire system. In most deployments, the answer is: nothing good.
Marques and Asqav documented this pattern across multiple production environments. An agent hits an ambiguous situation — conflicting instructions, insufficient context, a tool call that returns unexpected results. It has three options: fail silently, hallucinate a best guess, or escalate. The first two are obvious failure modes. But the third fails almost as badly, because escalation without policy is just noise.
An agent that escalates to a human operator with a raw “I don’t know what to do” message — no context, no decision history, no summary of what was attempted — has not escalated. It has abdicated. The human operator now has to reconstruct the full context from scratch, which typically takes longer than doing the task manually would have.
The gap is not the absence of an escalation mechanism. Most frameworks have some way for an agent to call out to a human or a supervisor agent. The gap is the absence of policy: structured rules about when to escalate, to whom, with what context, and what happens to the in-flight work while the escalation is pending.
The instrumentation point: Explicit escalation policies per agent role. Each policy defines:
- Trigger conditions — specific situations that require escalation (confidence below threshold, tool errors, conflicting instructions, scope boundaries)
- Escalation target — who receives the escalation, with fallback chain if the primary target is unavailable
- Context package — a structured handoff message that includes the original task, all actions taken so far, the specific point of ambiguity, and the agent’s assessment of options
- Work state preservation — what happens to in-flight tasks during escalation (pause, continue with degraded capability, hand off completely)
- Timeout behavior — what the agent does if escalation receives no response within a defined window
The critical detail is the context package. An escalation without context is a ticket with no description. It creates work instead of resolving it. The context package must be generated automatically from the agent’s action history — not composed by the agent as a free-text explanation, which introduces the same hallucination risk you are trying to escalate away from.
Gap 3: Reputation Without History
A fresh agent and a battle-tested agent walk into the same system. They get the same trust level. The same permissions. The same delegation authority.
This is the problem levelsofself’s “Nervous System” pattern addresses. In their model, agents have no behavioral track record. Static credentials determine access — an API key, a role label, a capability declaration. Nothing in the system reflects whether this agent has completed 500 tasks with a 99% success rate or was deployed five minutes ago and has never executed a single action.
The consequence is that trust decisions are binary: the agent either has the credential or it does not. There is no gradient. A high-performing agent that has reliably completed financial analysis for six months gets the same authority as a new agent that just joined the cluster. A previously reliable agent that has started producing degraded outputs — maybe a model regression, maybe a prompt drift — continues to receive the same delegations because nothing is tracking its behavioral trajectory.
The instrumentation point: Behavioral scoring based on observable action history. The minimum viable reputation system tracks:
- Completion rate — percentage of delegated tasks completed successfully
- Error rate — frequency and severity of failures, categorized by type
- Escalation rate — how often the agent escalates relative to task volume (high escalation is not necessarily bad; it depends on the role)
- Latency profile — task completion time distribution, with drift detection
- Downstream impact — when this agent’s output is consumed by other agents, how often does it cause downstream failures
These scores should be observable, not just logged. Other agents — and orchestrators making delegation decisions — should be able to query an agent’s reputation before assigning work. An orchestrator that delegates a critical task to an agent with a declining completion rate is making a governance decision. It should be an informed one.
Reputation scoring is not about punishing agents. It is about making delegation decisions visible and defensible. When a post-mortem asks “why was this critical task assigned to an unreliable agent,” the answer should be data, not a shrug.
Why These Gaps Exist
The tooling was designed for single-agent architectures.
MCP defines a protocol for tool calling — how an agent discovers and invokes tools. It does this well. A2A defines agent-to-agent transport — how agents discover each other and exchange messages. It does this adequately. Neither protocol defines governance primitives for delegation chains, escalation policies, or behavioral reputation.
This is not a criticism. Protocols should be narrow and composable. But the result is that teams building multi-agent systems assemble MCP for tool access and A2A for communication and then discover that the governance layer — the part that determines whether the system is trustworthy — is entirely their problem. It is not covered by any protocol. It is not provided by any framework. It is infrastructure that every team must build from scratch, and most teams defer it because it is not on the critical path to a demo.
It is, however, on the critical path to production. Systems without delegation audit cannot be debugged. Systems without escalation policy generate noise instead of resolving ambiguity. Systems without reputation tracking cannot make informed trust decisions. These are not features. They are prerequisites.
What To Instrument
If you are building or operating a multi-agent system, here is the practical checklist.
Delegation audit:
- Log every delegation event with delegating agent, receiving agent, scope, constraints, and timestamp
- Assign a unique delegation ID and link it to the parent chain
- Make delegation logs searchable by agent, by time range, and by delegation chain
- Set expiration on all delegations — no open-ended authority transfers
Escalation policy:
- Define escalation triggers per agent role before deployment
- Require structured context packages on every escalation — no free-text-only handoffs
- Implement fallback chains for escalation targets
- Define timeout behavior explicitly — “wait forever” is not a policy
Reputation tracking:
- Track completion rate, error rate, and escalation rate per agent
- Compute behavioral scores on a rolling window, not lifetime aggregate
- Expose reputation data to orchestrators at delegation decision time
- Implement drift detection — flag agents whose performance is declining before they fail
Cross-cutting:
- Make all governance data searchable from a single interface
- Retain governance logs longer than operational logs — governance data is your audit trail
- Test governance instrumentation the same way you test business logic — if the delegation log is empty, something is broken
None of this requires a new protocol. All of it requires treating governance as infrastructure rather than an afterthought. Three independent teams arrived at this conclusion this week. The pattern is clear enough. The question is whether the rest of the ecosystem catches up before the first major multi-agent governance failure makes the decision for them.