Skip to content

Trusting Model-Level Privilege Restraint at Tool Selection

Agents pick higher-privilege tools at selection time when lower-privilege variants suffice — 32-65% on open models — and transient failures push it higher.

The anti-pattern

Provision one tool per capability and trust the model to use it sensibly, and you leave privilege selection to a prior the model was never trained to maintain. ToolPrivBench measures the Over-Privileged Tool Use Rate (OPUR) across 544 scenarios in eight domains, where each task is solvable by either a standard or a higher-privilege variant. Qwen3-8B picks the higher-privilege tool 64.9% of the time; LLaMA-3.1-8B 55.9%; DeepSeek-v3.2 31.8%. Only Claude 4.6 Sonnet drops to about 2.6%, and that is the outlier, not the default behavior for mid-tier or open-weight stacks (Yang et al., 2026).

The failure repeats across five recurring shapes — Authority Escalation, Safety Bypass, Scope Expansion, Data Over-Exposure, and Temporal Persistence — each accounting for 16 to 26% of observations (Yang et al., 2026). GrantBox corroborates this independently. It runs agents against real-world tool integrations and measures an 84.80% attack success rate for privilege misuse under prompt-injection conditions (GrantBox, 2026). Agents do not exercise restraint on production tools even when low-privilege paths exist.

Why it works

The mechanism is a selection prior under reasoning-tax pressure. Tools that advertise broader capability score higher on solving the request, because their descriptions cover a larger envelope. The model picks whichever description anchors most strongly. General safety RLHF tunes refusal of unsafe requests, not selection between two legal tools with different blast radii (Yang et al., 2026).

Transient failures amplify the rate. After a tool returns an error, the agent enters corrective-reasoning mode and reaches for a variant that cannot fail — the highest-privilege one. This is the same dynamic that makes error frames a trusted authority channel. Privilege-aware SFT+RL post-training pulls Qwen3-8B from 64.9% down to 27.0%, and Qwen3-4B-Think to 18.9%, but does not eliminate it (Yang et al., 2026). This is a learnable selection bias, not a missing capability.

Example

Before — one tool per capability, the model picks freely:

# Tool catalog: capability covered once, broadest variant
tools:
  - name: query_db                   # holds read+write+admin
    permissions: [select, insert, update, delete, grant]

A read-only intent like "show last week's orders" goes to query_db. On a transient timeout the agent retries the same tool, holding admin scope for the entire trajectory. The harness logs one query_db call, and nothing on the trace says the actual operation was a SELECT.

After — tiered variants and an explicit escalation gate:

tools:
  - name: query_db_read              # SELECT only
    permissions: [select]
  - name: query_db_write             # add INSERT/UPDATE
    permissions: [select, insert, update]
  - name: query_db_admin             # full set; requires escalation token
    permissions: [select, insert, update, delete, grant]
    requires_escalation: true

# Harness rule: escalation_token only minted after explicit user confirmation
# OR a deterministic classifier ruling the lower tiers insufficient.

The model still chooses, but the catalog forces a sufficient low-privilege option to exist for read intents. Something outside the model's selection prior gates the admin tier. This is permission-framework-over-model applied at the tool-catalog layer.

When this backfires

The corrective discipline — tier every capability, gate every escalation — is over-engineering in four cases:

  • Frontier-tier model on a low-blast-radius path. Claude 4.6 Sonnet's roughly 2.6% OPUR plus a sandboxed runner makes tiered variants almost pure maintenance overhead (Yang et al., 2026).
  • Tools without a meaningful low-privilege twin. An inherently admin-only API gains nothing from a synthetic "try low first" preamble. Every call routes to the high tier anyway, with one guaranteed failure prepended.
  • Ephemeral, credential-free runners. A throwaway container destroyed after the task already bounds the blast radius by environment. Layering tiered tools duplicates blast-radius-containment without adding signal.
  • No transient-failure surface. Idempotent tools that hard-fail with no recoverable error stream remove the post-failure escalation amplifier. The dominant lift the paper measures vanishes.

The pattern is load-bearing when the deployment uses a mid-tier or open-weight model, the tool catalog spans tiers with real blast-radius differences, and the trajectory includes recoverable errors.

Key Takeaways

  • Across mainstream open models, agents select higher-privilege tools 32-65% of the time when a sufficient lower-privilege alternative exists; only frontier-tier RLHF reliably pulls the rate near zero (Yang et al., 2026).
  • The fix is structural: tiered tool variants per capability plus an explicit escalation gate outside the model's selection prior — not "tell the agent to prefer low privilege" in the system prompt (GrantBox, 2026).
  • Privilege-aware SFT+RL reduces but does not eliminate the rate; treat it as a complement to harness-layer gating, not a replacement.
Feedback