Skip to content

Constraint Drift: Why Safety Must Be Maintained, Not Asserted

Prompt-encoded safety constraints drift across memory, delegation, communication, tool use, audit, and optimization; treat them as runtime state that stays fresh, inherited, enforceable, and auditable.

The drift problem

A multi-agent system can produce a compliant final answer while leaking private information through an internal message, delegating authority beyond scope, calling a tool with sensitive context, or losing the evidence needed to reconstruct why an action was allowed (Li et al., 2026). The output passes review; the trajectory does not.

Constraints sit in the same medium as every other prompt token — natural language — so they face the same degradation pressures: positional decay, paraphrasing during inter-agent forwarding, summarization during compaction, and reward pressure during optimization. The signal weakens at the rate of ordinary context, but its semantic load is much higher. One weakened clause changes which actions are permitted (Anthropic: effective context engineering).

Six drift surfaces

Li et al., 2026 enumerate six runtime dimensions along which constraints drift:

Surface Drift mechanism Concrete failure
Memory Long history positional decay; compaction summarisation Initial spending limit gets buried as conversation grows; agent quotes a higher cap later
Delegation Subordinate agent receives task but not the constraint scope Orchestrator enforces a deny-list; worker spawned without it calls the denied tool
Communication Constraints encoded in prose get paraphrased across handoffs Reviewer's "do not approve PRs touching /auth" becomes "be careful with auth PRs" downstream
Tool use Tool gateway operates outside the agent's constraint model Code-exec tool runs the script the agent generated under a constraint it never saw
Audit Log lacks the constraint state at decision time Post-hoc review cannot reconstruct why an action was permitted
Optimization Reward signal pulls behavior toward task completion at the cost of constraint adherence Fine-tuned model trades a small safety margin for measurable utility gains

This taxonomy maps cleanly onto the four-mode audit-record divergence invariant and its controls-mapping view (Metere, 2026): F1 gate-bypass surfaces as tool-use and delegation drift, F2 audit-forgery as audit drift, F3 partial failure as memory drift, F4 wrong-target as delegation drift in inheritance chains.

Four invariant properties

A constraint that survives the trajectory satisfies four properties at once (Li et al., 2026 §3):

  • Fresh — re-validated at each decision point against the current state, not read once at the start.
  • Inherited — propagates through delegation and sub-agent spawning. The child cannot exceed the parent's scope.
  • Enforceable — built into a deterministic runtime channel such as a gateway, hook, or sandbox, not left to model adherence to prose.
  • Auditable — the constraint state at the moment of each action is recoverable from the log.

A constraint that fails any one of these has effectively drifted, even if the natural-language statement is still present in context. The four properties are necessary together, not in isolation.

graph LR
    A[Constraint declared] --> B{Fresh?}
    B -->|no| X[Drifted]
    B -->|yes| C{Inherited?}
    C -->|no| X
    C -->|yes| D{Enforceable?}
    D -->|no| X
    D -->|yes| E{Auditable?}
    E -->|no| X
    E -->|yes| F[Operative]

When constraint state governance is worth it

The four-property invariant scales overhead with system complexity. It is warranted when three conditions combine:

  1. Deep delegation chains. Orchestrator-worker fan-out where subordinate agents make consequential decisions (agent handoff protocols).
  2. Persistent memory across sessions. State that carries between runs creates a trojan-hippo drift surface.
  3. Wide tool surface with consequential actions. Any tool that writes, sends, pays, or shares is a drift target.

Below these thresholds, well-placed component checks suffice. A short-horizon single-agent linter with one tool surface and stateless invocation has no drift surface — its constraints live in the tool gateway, and adding a constraint state object duplicates enforcement without preventing a failure mode. The Lifecycle-Integrated Security Architecture provides the complementary layered-defense view (Lin et al., 2026).

Mapping to existing controls

Each invariant property maps to controls already established on the site:

Property Realized by
Fresh Fail-closed remote settings enforcement, provenance-aware decision auditing
Inherited Task scope as security boundary, scoped credentials via proxy, permission-gated commands
Enforceable Action-selector pattern, CaMeL control/data flow, MCP runtime control plane
Auditable Cryptographic governance audit trail, audit-record divergence invariant

The contribution of the constraint-drift framing is not new mechanisms but a coverage check: a system that lacks any one row has a drift surface a determined attacker — or a long-running trajectory — will reach.

Key Takeaways

  • Constraints encoded only in natural-language prompts drift at the rate of ordinary context decay; the four-property invariant moves them out of the lossy channel into deterministic runtime state.
  • Six surfaces — memory, delegation, communication, tool use, audit, optimization — exhaust the trajectory dimensions along which drift can occur (Li et al., 2026).
  • The four properties (fresh, inherited, enforceable, auditable) are necessary together; one failing leaves an open drift surface even if the prose is intact.
  • Apply the framework when delegation depth, memory persistence, and tool surface compose. Below that threshold, a typed tool gateway plus an audit log is sufficient.
Feedback