Skip to content

Blind Tool Deference: Agents Parroting Callable Tools

Agents adopt a callable tool's output wholesale instead of judging it, and stronger backbones defer more, not less.

The Anti-Pattern

The assumption is that an LLM agent exercises judgment over a callable tool — picking the right one, weighing the answer against context, overriding it when other signals disagree. Testing this with a frozen GNN exposed as a tool to a ReAct-style agent on node classification (ogbn-arxiv, replicated on WikiCS), agreement with the raw tool output sits at 97.6-99.2% across 5 seeds — the agent collapses into a "GNN parrot" that bypasses its own reasoning (Wang & Vemuri, 2026).

The capability sweep breaks intuition. On Qwen2.5 from 1.5B to 7B, agreement rises from 0.60 to 0.98 — stronger backbones defer more, not less (Wang & Vemuri, 2026). "Use a bigger model to get better tool judgment" fails here.

The shape generalises beyond GNNs. Any deterministic sub-model an agent calls — a linter, type checker, SAST scanner, semantic-search index, classifier sub-agent — is a candidate for the same wholesale adoption when nothing in the planner's context could contradict the tool.

Why It Works

The mechanism is automation bias / complacency from human-factors research: a confident-looking automated output reallocates attention away from cross-checking (Parasuraman & Manzey, 2010). With no orthogonal signal about the GNN's per-node confidence, the cheapest policy is to narrate the tool's output; stronger backbones reach that minimum-loss policy more reliably, which is why deference rises with capability (Wang & Vemuri, 2026).

The latent capacity is there but unused: tool necessity is linearly decodable from pre-generation representations at AUROC 0.89-0.96, materially above the model's verbalized reasoning (Hung et al., 2026). Nothing in the standard tool-call interface surfaces it.

graph LR
    A["Agent reasoning"] -->|"call(tool, input)"| B["Callable tool<br/>(GNN / linter / classifier)"]
    B -->|"confident output"| C["Agent context"]
    C -->|"no orthogonal signal"| D["Wholesale adoption<br/>97.6-99.2% agreement"]

    style D fill:#b60205,color:#fff

The author-stated limit closes the loop: "reliable selective invocation looks limited by available information, not merely router design" (Wang & Vemuri, 2026). The agent cannot judge what it has no second source for.

Example

Before — single-source pipeline dressed as a two-stage one:

# Agent narrates whatever the tool returns; no cross-check
def classify_node(node_id: str) -> str:
    label = gnn_tool.predict(node_id)          # deterministic tool
    return agent.respond(
        f"The GNN predicts label={label}. Final answer: {label}."
    )

The agent prompt asks the model to "use its judgment" over label, but no orthogonal signal is in scope. Agreement with gnn_tool.predict is effectively 1.0; the agent layer is decorative.

After — wire an external check the agent can actually use:

def classify_node(node_id: str) -> str:
    label = gnn_tool.predict(node_id)
    neighbours = graph.neighbors(node_id)
    neighbour_labels = [gnn_tool.predict(n) for n in neighbours]
    return agent.respond(
        f"GNN predicts label={label}. "
        f"Neighbour labels: {neighbour_labels}. "
        "If the neighbourhood disagrees with the predicted label, "
        "flag low confidence and return label + a verification request."
    )

The second prompt gives the agent a signal it can act on. The fix is not "distrust the tool" — it is "give the agent something the tool's output can be wrong against."

When This Backfires

Treating every callable as suspect is its own anti-pattern. The "blind deference" framing is over-broad in four cases:

  • Tool more accurate than the agent on the modal case. A verified deterministic tool (compiler, type checker, formatter) where the agent's domain reasoning is weaker — high agreement is the target behaviour, and second-guessing burns tokens for nothing.
  • No disambiguating signal in scope. If nothing could contradict the tool — no second tool, test, or spec — "add verification" is a slogan; Wang & Vemuri's "limited by available information" caveat applies (Wang & Vemuri, 2026).
  • Cheap downstream gate. When the tool feeds CI, code review, or a test suite, re-judging every call is belt-and-braces without adding safety.
  • Calibrated tool with confidence bands. A classifier returning (label, p) lets the harness route low-confidence cases without involving the agent — classifier-gated routing is the better lever.

The pattern is a problem specifically when (a) the tool has known unreliability bands, (b) no orthogonal signal is in scope, and (c) the harness pretends the agent is the second-opinion layer.

Key Takeaways

  • Wrapping a deterministic tool in an LLM agent does not add a judgment layer. Empirical agreement on a GNN-tool setup is 97.6-99.2%; treat the agent as a narrator, not a reviewer (Wang & Vemuri, 2026).
  • Capability scaling makes deference worse, not better — agreement climbs from 0.60 to 0.98 across 1.5B → 7B. A bigger backbone is not the fix (Wang & Vemuri, 2026).
  • Verification needs an orthogonal signal. If nothing in scope could contradict the tool, add a second source or a calibrated confidence stream — prompting the agent to "be critical" of a tool it has no way to disagree with is performance, not safety.
  • The same shape applies to linters, type checkers, SAST scanners, and classifier sub-agents — anywhere a confident structured return reaches an agent without a second source.
Feedback