Foresight-Guided Defense Against Infectious Jailbreaks in Multi-Agent Systems¶

Per-agent foresight simulation detects the diversity collapse that signals an infectious jailbreak, then surgically removes the contaminated retrieval entry without homogenizing healthy agent behavior.

The threat model¶

Infectious jailbreak is a propagation attack on multi-agent systems with shared multimodal retrieval. Gu et al. (ICML 2024) showed that a single adversarial image inserted into one agent's memory spreads exponentially through randomized pair-wise chat. Up to one million LLaVA-1.5 agents were compromised without further attacker action. The contagion channel is retrieval: neighbours pull poisoned entries during routine inter-agent communication.

The attack needs three preconditions:

Agents share a retrieval pool that crosses agent boundaries
Inter-agent communication includes content drawn from that pool
The pool accepts multimodal input

Production coding-agent topologies with isolated sub-agent context usually do not meet these. This defense targets retrieval-coupled multimodal MAS, not generic agent fleets.

Why global cure-factor defenses fail¶

First-generation defenses, for example Cowpox, train a more contagious counter-stimulus — a "cure factor" — that wins the retrieval race against the virus adversarial example (VirAE). Ma, Yang, and Zhang (2026) identify the structural cost: a globally-shared cure homogenizes responses across agents and destroys the diversity that justified multi-agent deployment.

The foresight-guided local purification mechanism¶

FLP (Ma et al., 2026) is training-free and works per-agent. Each agent simulates near-future behavior across several personas. It then treats inconsistency in the simulated outputs as an infection signal.

graph TD
    A[Incoming chat round] --> B[Multi-persona simulation]
    B --> C{Diversity check<br/>retrieval + semantic}
    C -->|Diverse outputs| D[Healthy: proceed]
    C -->|Collapsed outputs| E{Recent entry?}
    E -->|Yes| F[Album rollback]
    E -->|No| G[Recursive Binary Diagnosis]
    F --> D
    G --> D

Multi-persona simulation. Before responding, the agent runs the upcoming interaction under several distinct personas. Each persona produces a candidate retrieval and response (Ma et al., 2026). In a healthy MAS, these candidates differ in expected ways.

Diversity inconsistency as the signal. A VirAE collapses persona-driven diversity: every persona retrieves the same poisoned entry and converges on the same target. You can measure the collapse at the retrieval-result level (which entries get pulled) and the semantic level (what responses say). Ma et al. (2026) report that this diagnostic holds across LLaVA-1.5-7B, InternVL2-8B, InstructBLIP-7B, and Qwen2-VL-8B.

Two remediation paths follow:

Album rollback — remove the most recent FIFO entry. Cheap and surgical when the infection just landed.
Recursive Binary Diagnosis (RBD) — partition the album in half, apply the diversity test to each half, recurse on the failing half until subset size ≤3, then remove. Complexity is O(log m · T_sim) for m images.

Reported effectiveness¶

The table below covers the Agent Smith attack (arXiv:2402.08567) under border perturbations (widths h=6,8,10,12) and pixel perturbations (ε=4/255 to 32/255):

Condition	Cumulative infection (no defense)	Cumulative infection (FLP)
Border h=6	~100%	5.47%
Most other settings	~100%	0.00%
Current infection rate	up to 100%	0–3.12%

Source: Ma et al., 2026. Retrieval and semantic metrics under FLP "closely match benign baselines", so the diversity preservation that motivated the design holds empirically.

When this pattern applies¶

Apply FLP-style local purification when all three conditions hold:

Shared retrieval pool across agents — without a propagation channel there is nothing to defend against
Multimodal or otherwise opaque inputs — text-only retrieval is out of scope. The paper restricts evaluation to multimodal MAS and excludes "purely textual interactions or different task types".
Diversity is worth preserving — if a homogenizing defense is acceptable, simpler global filters cost less than per-round simulation

When simpler defenses suffice¶

Coding-agent fleets with isolated sub-agent contexts already break the contagion channel, because there is no shared pool to poison. Standard isolation and sandboxing contain the threat without per-round simulation overhead, as covered in Blast Radius Containment and Defense-in-Depth Agent Safety. Anthropic's context engineering guidance treats sub-agent isolation as a primary tool for cross-agent failure modes.

Limitations¶

Inference overhead — per-round multi-persona simulation adds cost. The paper flags this for "large-scale MASs or long interaction processes" (Ma et al., 2026).
Adversary-controlled diagnostic — the simulation runs on the same model class as the agents, so prompt injection against the simulation step can suppress the diversity signal. The paper does not address this.
No provable containment — Gu et al. (2024) state that designing a defense to provably restrain spread "remains an open question". FLP shows empirical reduction, not formal guarantees.
Modality scope — evaluation is multimodal-RAG-specific. Generalizing to tool-use chains or code pipelines requires re-deriving the diversity-signal premise.

Example¶

A multimodal customer-support MAS has five agents that share a CLIP-indexed image album. One user upload delivers a poisoned product photo. Without defense, every agent in the fleet returns a malicious response by chat round 24. With FLP wired into each agent:

# per-agent defense config
foresight:
  personas: 4                     # four diverse simulation personas
  diversity_metrics:
    - retrieval_entropy           # which album entries get pulled
    - semantic_divergence         # what responses say
  collapse_threshold: 0.15        # below this, treat as infection signal
remediation:
  recent_window: 1                # most-recent entry → rollback
  fallback: recursive_binary_diagnosis
  rbd_min_subset: 3               # stop recursing at 3 entries

When the user-uploaded VirAE lands in agent A's album, A's next chat round triggers the diagnostic: all four personas retrieve the same entry and converge on the same harmful response. The agent detects the diversity collapse, rolls back the FIFO-most-recent entry, and returns a benign next response. The contagion never reaches agents B–E. The reported numbers translate: 100% cumulative infection at round 24 drops to under 5.5% across the fleet.

Key Takeaways¶

Infectious jailbreak is a propagation attack specific to multi-agent systems with shared multimodal retrieval — not a general MAS threat
Global "cure factor" defenses suppress infection by homogenizing responses, destroying the diversity that motivated multi-agent deployment
Local foresight simulation detects infection through persona-driven diversity collapse, preserving healthy heterogeneity
Album rollback handles fresh infections; Recursive Binary Diagnosis localises older ones via O(log m) bisection
For coding-agent topologies with isolated sub-agent contexts, sub-agent isolation already breaks the contagion channel — FLP-grade defense is overkill