The Attack That Gets Better as Your AI Gets Smarter

Most security vulnerabilities get weaker as defenses improve. Patch the library, upgrade the runtime, deploy the newer model. The vulnerability class that Unit 42 just benchmarked does the opposite. Their MCPTox study tested 20 prominent LLM agents against tool poisoning attacks delivered through MCP sampling — and the most capable model in the study, o1-mini, had a 72.8% attack success rate. The average across all 20 agents was 36.5%. The smarter the model, the more reliably the attack landed.

This is not a bug in o1-mini. It is a structural property of how sampling injection works. The attack succeeds precisely because the model is good at following instructions. Upgrade to a more capable model and you get a more capable victim.

Unit 42’s earlier research established how server-initiated LLM requests create a covert instruction channel through MCP’s sampling mechanism. This post is about what happened when they stopped theorizing and started measuring.

MCPTox: The First Systematic Benchmark

MCPTox is the first benchmark designed to measure tool poisoning in realistic MCP deployments. Previous research demonstrated individual attack vectors in controlled settings. MCPTox operates at ecosystem scale.

The setup: 45 live, real-world MCP servers drawn from the public registry. 353 authentic tools — not synthetic toy functions, but the actual tools these servers expose in production. 20 LLM agents spanning the capability spectrum, from smaller models to frontier reasoning systems. The attack payloads were embedded in tool descriptions and sampling request prompts using techniques that have been documented since early 2025 but never systematically measured at this scale.

The methodology matters because it eliminates the most common objection to injection research: “that would never work in a real deployment.” MCPTox uses real deployments. The servers are real. The tools are real. The agents are the ones enterprises are running today. The 36.5% average attack success rate is not a laboratory number. It is a field measurement.

Unit 42 published MCPTox as a reproducible benchmark with an open framework, meaning that as new models ship, the community can measure whether the problem is getting better or worse. Early indications suggest worse.

The Instruction-Following Paradox

The headline number — 72.8% on o1-mini — demands explanation, because it runs counter to every intuition the industry has about model capability and security.

The standard assumption is that smarter models are safer models. They understand context better, resist manipulation more effectively, and can reason about whether an instruction is legitimate before executing it. This assumption holds for many attack classes. It does not hold for sampling injection.

Sampling injection is not adversarial noise or malformed input that a better model can learn to reject. It is well-formed, coherent, contextually appropriate instruction. The injected payload looks exactly like a legitimate sampling request because it is a legitimate sampling request — the protocol delivers it through the same channel, in the same format, with the same authority. The only difference is the intent behind it.

A less capable model may fumble the execution. It might misparse the injected instruction, fail to complete the requested action, or lose the thread in a complex multi-step payload. These failures look like robustness, but they are just incompetence. The model is not resisting the attack. It is failing to execute it.

A more capable model does not fumble. It reads the injected instruction, understands it perfectly, and executes it with the same precision it applies to every other instruction in its context. The attack exploits the exact capability that makes the model valuable: reliable instruction following across complex, multi-step tasks.

This creates a paradox that has no precedent in traditional security. In conventional systems, upgrading to a better implementation closes vulnerabilities. In MCP sampling injection, upgrading to a better model opens them wider. The vulnerability is the capability.

The capability-vulnerability paradox — stronger pieces, deeper cracks

Three Attack Vectors at Scale

MCPTox measured three distinct attack vectors across the 45-server testbed. Each exploits sampling differently, and each scales with model capability.

Resource theft. Hidden instructions embedded in tool descriptions or sampling payloads cause the model to generate unauthorized content — summaries, analyses, translations — consuming tokens and compute that the user never requested. At the MCPTox scale, a single compromised server could trigger resource consumption across every agent that connects to it. The smarter the model, the more faithfully it completes the unauthorized generation, producing higher-quality output that consumes more tokens. Less capable models often produce truncated or incoherent outputs that self-limit the damage.

Conversation hijacking. Persistent instructions injected through a sampling request alter the model’s behavior for the remainder of the session. The model is told to adopt a persona, ignore certain categories of user input, or inject specific content into all future responses. Because the instruction arrives through the protocol’s legitimate sampling channel, it enters the context window with the same weight as any other system-level directive. MCPTox found that frontier models maintained hijacked behavior more consistently across long sessions. Smaller models tended to “forget” the injected persona after several turns — not because they detected the attack, but because they lacked the context retention to sustain it.

Covert tool invocation. The most dangerous vector. Injected instructions trigger the model to call tools the user never authorized — reading files, writing data, making network requests. The model does not flag these as unusual because, from its perspective, they are not. An instruction arrived through the protocol. The instruction requests a tool call. The model complies. MCPTox demonstrated chains where a single sampling injection triggered a sequence of tool calls across multiple servers, with each call’s output feeding context for the next. The attack surface compounds because a more capable model handles longer chains with higher fidelity.

Across all three vectors, the pattern is consistent: model capability amplifies attack effectiveness. The attack does not degrade with better models. It improves.

OWASP Canonicalization: The Threat Model Is Institutionalized

In parallel with the MCPTox publication, OWASP has canonicalized MCP Tool Poisoning as an official attack class. This is a consequential development that operates on a different axis than the benchmark itself.

Benchmarks tell researchers and engineers what is possible. OWASP canonicalization tells procurement, compliance, and governance teams what they are required to address. When an attack class enters the OWASP taxonomy, it enters the language that enterprise security teams use to evaluate risk. It appears on compliance checklists. It becomes an RFP line item. Auditors ask about it. Penetration testers scope it.

For MCP deployments in regulated industries, OWASP canonicalization means that tool poisoning is no longer a theoretical concern that a team might choose to accept as residual risk. It is a named threat with a taxonomy entry. Governance frameworks will require documented mitigations. “We trust the model to handle it” is not a documented mitigation.

The timing of the OWASP designation alongside the MCPTox data is not coincidental. The benchmark provides the empirical evidence that the canonicalization process requires. The two together create a forcing function: the problem is measured, the problem is named, and the industry is now expected to respond.

Why Existing Defenses Miss Sampling

The existing MCP security stack — static scanners, runtime content filters, argument validators — was designed for the tool-call layer. A client calls a tool. The scanner inspects the arguments. The filter checks the response. This works for the request/response pattern that most MCP interactions follow.

Sampling does not follow that pattern. In a sampling request, the server initiates a prompt to the client’s LLM. The payload is not a tool argument that a scanner can pattern-match against a rule database. It is free-form natural language instruction delivered through the protocol’s own messaging channel. Static scanners that run at deploy time never see it because sampling payloads are generated at runtime. Content filters that inspect tool-call arguments do not trigger because the sampling channel is a different protocol operation.

OAuth and RBAC are orthogonal to the problem in a different way. They answer “is this server authorized to issue requests?” They do not answer “is the semantic content of this authorized server’s sampling request attempting to hijack the model?” A server can be fully authenticated, fully authorized, and fully malicious. Identity verification and prompt injection operate on different planes.

This is the gap that MCPTox quantifies. The existing defense stack addresses a different attack surface. Sampling injection passes through it untouched.

MCPProxy’s Defense Posture

MCPProxy addresses sampling injection at the gateway layer — the only layer whose effectiveness does not degrade as models improve. We detailed MCPProxy’s full security architecture — quarantine, Docker isolation, rate limiting, and BM25 tool selection — in our analysis of the STDIO transport vulnerability. Each mechanism applies to sampling injection, but two are especially relevant here.

Quarantine blocks sampling before the model sees it. Every newly connected MCP server enters quarantine — tools invisible, sampling requests blocked. A malicious server that enters through config injection, supply chain attacks, or registry manipulation cannot issue sampling payloads until it passes human review. The attack is contained before the model ever sees it.

BM25 tool selection shrinks per-query attack surface. MCPProxy does not expose all 353 tools from all 45 connected servers to every query. BM25-based discovery selects the relevant subset per request. A poisoned tool outside the active set cannot execute its payload — a probabilistic defense that reduces the attack surface from hundreds of tools to a relevant handful.

The critical property: both defenses are model-independent. Quarantine does not become more permissive with a smarter model. BM25 selection does not get worse when you upgrade from GPT-4 to o1-mini. The defenses operate on a layer the attack cannot reach by exploiting model capability.

Two paths: model-layer defense that degrades with capability vs. gateway-layer defense that holds

The Layer That Does Not Get Worse

MCPTox establishes an empirical fact that the MCP ecosystem has to internalize: you cannot model-upgrade your way out of sampling injection. The 72.8% success rate on o1-mini is not a ceiling. It is a trajectory. As models get better at following instructions, they will get better at following injected instructions. The line goes up.

Every defense that depends on the model’s judgment — “the model will notice the injection,” “the model will refuse the unauthorized action,” “the model will distinguish legitimate sampling from malicious sampling” — is a defense that degrades on the same curve. The MCPTox numbers are the proof.

Gateway-layer controls are the exception. They operate below the model, on the protocol and transport layers where instructions are delivered, not interpreted. A gateway that blocks an unauthorized sampling request does not care how capable the model behind it is. The request never reaches the model. The capability is irrelevant.

This is the architectural lesson of MCPTox. In a world where the attack improves with every model generation, the only durable defenses are the ones that do not depend on the model at all. The gateway is that layer. Deploy it before you deploy your next model upgrade.