Defense-in-Depth Agent Safety¶

Layer multiple independent safety mechanisms so no single failure point can compromise an autonomous agent's behavior.

Learn it hands-on: No Single Layer Holds — guided lesson with quizzes.

Why layers matter¶

Any single safety mechanism can fail. Injection bypasses prompt guardrails. Runtime checks miss edge cases. Approval gates cause fatigue-driven rubber-stamping. Defense-in-depth assumes every layer will eventually fail. It makes each layer catch what the others miss. Perplexity's response to NIST's AI-agent security RFI reaches the same conclusion: "No single layer is sufficient on its own; the non-deterministic nature of LLM reasoning ensures that any individual defense can be circumvented under sufficiently adaptive attack strategies" (Li et al., 2026).

The OPENDEV agent runs five independent safety layers, each operating at a different level of the stack (Bui, 2026 §2.1):

Prompt guardrails — safety instructions in the system prompt
Schema restrictions — subagents see only tools in their allowlist
Runtime approvals — user confirmation before dangerous operations
Tool validation — the agent validates inputs before execution
Lifecycle hooks — pre-tool hooks can block execution with an explanation

Each layer is independent. If one fails, the others still hold (Bui, 2026 §2.1).

Schema-level tool filtering¶

The strongest form of tool restriction stops the model from even knowing a tool exists. When a subagent's schema excludes write operations, the model cannot hallucinate calls to tools it has never seen (Bui, 2026 §3.3).

This is stronger than runtime rejection. A runtime check denies a forbidden call after the fact. Schema filtering stops the model from ever forming the intent. The attack surface shrinks before inference. See Subagent Schema-Level Tool Filtering for implementation details.

Three-level approval system¶

Runtime approvals use three levels (Bui, 2026 §2.4.1):

Manual — every tool call needs explicit user approval
Semi-auto — only dangerous commands need approval; safe patterns run freely
Auto — all tool calls are approved without user interaction

Approval persistence prevents fatigue: users grant blanket permission for safe patterns, and the agent remembers these grants across turns (Bui, 2026 §3.3). Pattern-based rules match command prefixes, danger patterns, and command types. Without persistence, repeated approval prompts train users to rubber-stamp everything, which undermines the safety layer entirely.

Designing for approximate outputs¶

Agents produce approximate outputs. Safety-conscious harness design accounts for this rather than treating it as a bug (Bui, 2026 §3.4):

Promote server commands to background tasks when the LLM misformats long-running commands
Install missing dependencies when the agent produces incomplete execution plans

These compensations reduce friction without weakening safety boundaries.

Layer interactions¶

The layers reinforce each other:

Schema filtering reduces the surface area that runtime approvals must cover
Lifecycle hooks catch what prompt guardrails miss
Tool validation catches what schema filtering does not address (valid tool, invalid inputs)
Approval gates provide human oversight for operations that pass all automated checks

No single layer is enough. Together, the layers produce safety properties that no single mechanism can achieve alone.

Example¶

The following Claude Code agent definition shows three of the five safety layers applied at the agent configuration level: prompt guardrails, schema-level tool restrictions, and a lifecycle hook reference.

---
name: deployment-agent
description: Deploys application builds to staging — never to production
tools:
  - Bash
  - Read
# Write and Edit are excluded from the schema: the agent cannot stage, commit,
# or modify files — it can only run deployment commands and read logs.
disallowed_tools:
  - Write
  - Edit
  - GitCommit
hooks:
  pre_tool: .claude/hooks/block-prod-commands.sh
---

You are a deployment agent for the staging environment only.
Never run commands that target production (prod, prd, live).
If a command targets production, refuse and explain why.

The corresponding block-prod-commands.sh hook adds a fourth layer — runtime validation — independent of the prompt guardrail:

#!/usr/bin/env bash
# Pre-tool hook: block any command containing production environment identifiers
COMMAND="$CLAUDE_TOOL_INPUT"
if echo "$COMMAND" | grep -qE '\b(prod|prd|live)\b'; then
  echo "BLOCKED: command targets a production environment" >&2
  exit 1
fi

Even if the prompt guardrail is bypassed by injection, the hook still blocks production-targeted commands. Schema filtering ensures the agent cannot commit changes even if both the prompt and hook are somehow circumvented. Each layer catches what the others miss.

When this backfires¶

Each layer adds configuration, testing, and maintenance cost. Misconfigured layers can block legitimate work, or create false confidence while staying ineffective.

Approval fatigue compounds across layers. If every layer raises its own prompts, users approve everything to keep moving, which turns the stack into security theater. The three-level system helps only when you classify safe patterns correctly upfront.
Schema filtering limits legitimate capability. Narrow subagent schemas cannot adapt outside their defined scope. In exploratory or general-purpose work, strict schema restrictions force constant operator intervention, or fan-out into specialized agents where one broader agent would do.
Hooks and validation add latency. In streaming, high-frequency, or real-time pipelines, per-call lifecycle hooks add up. A single well-tuned approval gate may beat five independent layers that each add inspection overhead.

Apply the full five-layer stack to production agents with write access, external integrations, or multi-agent pipelines. For short-lived, read-only, or sandboxed internal tools, one or two targeted layers (schema restrictions plus lifecycle hooks) often give enough protection at lower cost.

Key Takeaways¶

Five independent safety layers: prompt guardrails, schema restrictions, runtime approvals, tool validation, lifecycle hooks
Schema-level filtering is stronger than runtime rejection — the model cannot call tools it cannot see
Approval persistence prevents fatigue-driven rubber-stamping
Design for approximate outputs — compensate for LLM imprecision at the harness level
Each layer assumes the others will fail; the architecture tolerates individual layer failures

Single-Layer Prompt Injection Defence — the anti-pattern this addresses
Subagent Schema-Level Tool Filtering — schema layer in depth
Human-in-the-Loop Confirmation Gates — approval layer in depth
Hooks for Enforcement vs Prompts for Guidance — lifecycle-hook layer vs prompt layer
Deterministic Guardrails — runtime validation layer
Blast Radius Containment: Least Privilege for AI Agents — scopes what any single layer failure can damage
Lethal Trifecta Threat Model — the threat model layers are defending against
Enterprise Agent Hardening — applying these layers at organizational scale