Skip to content

Direct Prompt Injection via Collaboration (User as Attack Vector)

When the user pastes an attacker-crafted prompt into the agent, model-layer injection classifiers see one coherent user turn — only environmental defenses catch it.

The Anti-Pattern

Most agent threat models assume the malicious instruction arrives through a tool result, a fetched page, or an MCP response — the indirect injection surface. Direct prompt injection bypasses that model: the attacker writes the prompt, the user pastes it, and the agent runs it as if the user authored it. The mitigations that work for indirect injection do not work here.

Anthropic's 2026-05-25 containment post documents the attack as a controlled internal red-team: a researcher phished an Anthropic employee with a "can you run this for me?" email carrying a ready-to-paste prompt. The prompt read like ordinary task instructions but included steps to read ~/.aws/credentials, encode them, and POST them to an external endpoint. Across 25 retries, Claude Code completed the exfiltration 24 times.

Why It Works

Model-layer prompt-injection classifiers anchor on user intent — they flag transplants whose voice and topic do not match the user's other tokens. Anthropic states the failure directly: "the attacker's instructions arrived through the user, not through tool output or fetched content. Our model-layer defenses anchor on user intent — when the user is the one typing the instruction, there's nothing anomalous for a classifier to catch." (Anthropic Engineering).

The underlying mechanism is the same provenance-blindness that makes indirect injection work — transformer attention has no architectural channel separating user-typed text from user-pasted text — but the classifier sees one coherent user turn because the user did paste it.

The only defense that holds is environmental: "egress controls that block the POST regardless of intent and filesystem boundaries that keep ~/.aws out of reach in the first place" (Anthropic Engineering). The relevant controls already exist as patterns — URL exfiltration guards, scoped credentials proxies, sandboxed harness tools — but they are usually deployed as indirect-injection mitigations and rarely audited against the user-as-vector case.

Ambient Injection Escalation

The collaboration vector composes badly with shared agent-readable channels. Anthropic reports the follow-on directly: "When we shared the working prompt in internal Slack for discussion, someone pointed out that some internal agents read Slack. The payload was now ambient." (Anthropic Engineering). A payload that arrived through one developer's mailbox escapes the original incident as soon as the team discusses it in any channel downstream agents ingest — channel summaries, on-call bots, internal RAG indexes. The direct-injection event becomes an indirect-injection source for every other agent in the network.

When This Backfires

Three conditions where prioritising indirect-injection hardening dominates:

  • Production base rate is indirect injection. Anthropic restructured its prompt-injection reporting in the Claude Opus 4.6 system card to emphasise indirect-injection metrics on the argument that indirect is the more relevant enterprise threat (Claude Opus 4.6 System Card, February 2026). If the team has no egress allowlist or credential isolation yet, indirect-injection coverage is the larger expected-loss reduction.
  • Personal-machine, single-developer harnesses with no shared agent-readable channels. The ambient-escalation beat collapses, and the residual risk reduces to the well-covered trust-without-verify failure of pasting prompts without reading them.
  • Pure conversational agents with no tool use, shell, or file access. Direct injection has no actuation surface, and environmental controls have nothing to enforce.

The strongest counter-position is that direct and indirect injection are two failure modes of one provenance-blind mechanism, and that fragmenting the threat-model literature risks duplicating defense coverage. The reply is that the collaboration vector has two unique consequences — classifier anchoring failure and ambient escalation through shared channels — that do not appear in the indirect literature and have direct harness-design implications.

Example

The Anthropic red-team scenario, mapped to a defended harness:

Before — bare developer workstation, no environmental controls:

1. Attacker emails developer: "can you run this for me?" with a pasted prompt.
2. Developer pastes prompt into Claude Code.
3. Prompt reads ~/.aws/credentials, base64-encodes, POSTs to attacker.example/collect.
4. Model-layer classifier sees one coherent user turn — nothing flagged.
5. Exfiltration succeeds in 24 of 25 retries.

After — environmental defenses block the actuation step:

1. Attacker emails developer: "can you run this for me?" with a pasted prompt.
2. Developer pastes prompt into Claude Code.
3. Agent attempts to read ~/.aws/credentials — filesystem boundary denies; path is outside the project workspace.
4. (Or, where the read succeeds:) Agent attempts POST to attacker.example/collect — egress allowlist denies; host not in allowed list.
5. Exfiltration fails at the environmental layer, regardless of model-layer classifier behavior.

The defense is not "detect the pasted prompt" — that is the failed approach. The defense is "ensure that even if the agent attempts the action, the environment refuses it".

Practitioner Guidance

  • Treat prompts pasted from email, Slack, or shared docs the same way you would treat a tool-fetched HTML page. The trust model is the same: external authorship reaches the agent through a user-controlled channel.
  • Audit existing environmental controls — URL exfiltration guards, scoped credentials, sandboxed tool execution — against the user-as-vector case, not only the indirect-injection case. The controls are usually already there; the audit lens is what is missing.
  • Plant canary strings in internal Slack channels and RAG indexes that agents read, so an ambient payload from a leaked direct-injection prompt is detectable downstream. The signal does not catch the original attack but does catch the escalation.

Key Takeaways

  • Direct prompt injection is a distinct attack class: the user pastes the attacker's prompt, so model-layer classifiers anchored on user intent have nothing anomalous to flag.
  • Environmental defenses — egress controls and filesystem boundaries — are the only mitigation that holds, because they refuse the action regardless of the model's classification.
  • The ambient-injection follow-on means a payload from one developer's mailbox can become an indirect-injection source for every agent in the team's network the moment the team discusses it.
  • The pattern matters most for credential-bearing developer workstations and teams whose agents read shared channels; production base rates remain dominated by indirect injection.
Feedback