Blind Tool Deference: Agents Parroting Callable Tools¶
Agents adopt a callable tool's output wholesale instead of judging it, and stronger backbones defer more, not less.
The Anti-Pattern¶
The assumption is that an LLM agent exercises judgment over a callable tool — picking the right one, weighing the answer against context, overriding it when other signals disagree. Testing this with a frozen GNN exposed as a tool to a ReAct-style agent on node classification (ogbn-arxiv, replicated on WikiCS), agreement with the raw tool output sits at 97.6-99.2% across 5 seeds — the agent collapses into a "GNN parrot" that bypasses its own reasoning (Wang & Vemuri, 2026).
The capability sweep breaks intuition. On Qwen2.5 from 1.5B to 7B, agreement rises from 0.60 to 0.98 — stronger backbones defer more, not less (Wang & Vemuri, 2026). "Use a bigger model to get better tool judgment" fails here.
The shape generalises beyond GNNs. Any deterministic sub-model an agent calls — a linter, type checker, SAST scanner, semantic-search index, classifier sub-agent — is a candidate for the same wholesale adoption when nothing in the planner's context could contradict the tool.
Why It Works¶
The mechanism is automation bias / complacency from human-factors research: a confident-looking automated output reallocates attention away from cross-checking (Parasuraman & Manzey, 2010). With no orthogonal signal about the GNN's per-node confidence, the cheapest policy is to narrate the tool's output; stronger backbones reach that minimum-loss policy more reliably, which is why deference rises with capability (Wang & Vemuri, 2026).
The latent capacity is there but unused: tool necessity is linearly decodable from pre-generation representations at AUROC 0.89-0.96, materially above the model's verbalized reasoning (Hung et al., 2026). Nothing in the standard tool-call interface surfaces it.
graph LR
A["Agent reasoning"] -->|"call(tool, input)"| B["Callable tool<br/>(GNN / linter / classifier)"]
B -->|"confident output"| C["Agent context"]
C -->|"no orthogonal signal"| D["Wholesale adoption<br/>97.6-99.2% agreement"]
style D fill:#b60205,color:#fff
The author-stated limit closes the loop: "reliable selective invocation looks limited by available information, not merely router design" (Wang & Vemuri, 2026). The agent cannot judge what it has no second source for.
Example¶
Before — single-source pipeline dressed as a two-stage one:
# Agent narrates whatever the tool returns; no cross-check
def classify_node(node_id: str) -> str:
label = gnn_tool.predict(node_id) # deterministic tool
return agent.respond(
f"The GNN predicts label={label}. Final answer: {label}."
)
The agent prompt asks the model to "use its judgment" over label, but no orthogonal signal is in scope. Agreement with gnn_tool.predict is effectively 1.0; the agent layer is decorative.
After — wire an external check the agent can actually use:
def classify_node(node_id: str) -> str:
label = gnn_tool.predict(node_id)
neighbours = graph.neighbors(node_id)
neighbour_labels = [gnn_tool.predict(n) for n in neighbours]
return agent.respond(
f"GNN predicts label={label}. "
f"Neighbour labels: {neighbour_labels}. "
"If the neighbourhood disagrees with the predicted label, "
"flag low confidence and return label + a verification request."
)
The second prompt gives the agent a signal it can act on. The fix is not "distrust the tool" — it is "give the agent something the tool's output can be wrong against."
When This Backfires¶
Treating every callable as suspect is its own anti-pattern. The "blind deference" framing is over-broad in four cases:
- Tool more accurate than the agent on the modal case. A verified deterministic tool (compiler, type checker, formatter) where the agent's domain reasoning is weaker — high agreement is the target behaviour, and second-guessing burns tokens for nothing.
- No disambiguating signal in scope. If nothing could contradict the tool — no second tool, test, or spec — "add verification" is a slogan; Wang & Vemuri's "limited by available information" caveat applies (Wang & Vemuri, 2026).
- Cheap downstream gate. When the tool feeds CI, code review, or a test suite, re-judging every call is belt-and-braces without adding safety.
- Calibrated tool with confidence bands. A classifier returning
(label, p)lets the harness route low-confidence cases without involving the agent — classifier-gated routing is the better lever.
The pattern is a problem specifically when (a) the tool has known unreliability bands, (b) no orthogonal signal is in scope, and (c) the harness pretends the agent is the second-opinion layer.
Key Takeaways¶
- Wrapping a deterministic tool in an LLM agent does not add a judgment layer. Empirical agreement on a GNN-tool setup is 97.6-99.2%; treat the agent as a narrator, not a reviewer (Wang & Vemuri, 2026).
- Capability scaling makes deference worse, not better — agreement climbs from 0.60 to 0.98 across 1.5B → 7B. A bigger backbone is not the fix (Wang & Vemuri, 2026).
- Verification needs an orthogonal signal. If nothing in scope could contradict the tool, add a second source or a calibrated confidence stream — prompting the agent to "be critical" of a tool it has no way to disagree with is performance, not safety.
- The same shape applies to linters, type checkers, SAST scanners, and classifier sub-agents — anywhere a confident structured return reaches an agent without a second source.
Related¶
- Trust Without Verify — operator-level analogue: humans accepting polished agent output without independent verification.
- Trusting Tool Error Messages as Implicit Authority — adjacent failure on the error stream rather than the success path.
- The Yes-Man Agent — sycophancy toward user requests; this page is sycophancy toward tool returns.
- Incremental Verification — the constructive counterpart: build the orthogonal signal an agent can actually verify against.
- Inference-Time Tool Call Reviewer — pattern that surfaces a separate review pass over tool calls before they propagate.