Skip to content

Dynamic Tool Fetching Breaks KV Cache

Loading tool definitions dynamically per step seems like good context management but destroys the single most impactful cost optimization available: prompt caching.

The intuition trap

Fewer tools means fewer tokens, so fetching only the needed tools per step seems best. It is not. The savings from removing tools are far smaller than the cost of breaking prompt cache continuity.

Why it fails

Tool definitions sit at the top of the cache hierarchy. The model computes the prefix in order: toolssystemmessages. Any change to tool definitions invalidates every level below it.

graph LR
    A["tools (top)"] --> B["system"] --> C["messages"]
    style A fill:#d32f2f,color:#fff
    style B fill:#f57c00,color:#fff
    style C fill:#fbc02d,color:#000

Cached tokens cost 10 times less than uncached. Claude Sonnet 4's cache-read rate is $0.30/MTok against a $3/MTok base input (Anthropic prompt caching). A single cache break per turn erases all savings from fewer tools.

Approach Tools in context Cache hit rate Effective cost
Stable tool set (30 tools) 30 every turn High Low
Dynamic RAG per step 5-15, varying Near zero High
Deferred loading (stable prefix) 8-10 core + search High Lowest

The subtle variant: non-deterministic serialization

Languages like Swift and Go randomize dictionary key ordering during JSON serialization. So the cache sees a different byte sequence even when the tools are identical. This triggers the same anti-pattern by accident.

Fix: sort the keys deterministically before serialization.

The correct alternative: deferred tool loading

Anthropic's Tool Search Tool reaches the same goal without breaking the cache prefix. Tools marked defer_loading: true stay out of the prompt, and the agent discovers them on demand.

Anthropic's evaluations:

Metric All tools loaded Deferred + search
Token usage ~55K ~8.7K
Accuracy (Opus) 49% 74%

The cache prefix stays identical across turns. Deferred tools load into message history, so they invalidate nothing.

Anthropic's advanced tool use guidance recommends grouping tools by how often you use them:

Level Contents Cache impact
Core tools (3–5) Most-used, always loaded Cached prefix, never changes
General utilities bash, code execution Part of stable prefix
Specialized tools Domain-specific, MCP servers Deferred; loaded via search on demand

When this backfires

Deferred loading adds a tool search round-trip for each undiscovered tool. It gives no benefit when:

  • Tool library is small (under 10 tools): upfront loading costs less than the repeated search overhead.
  • All tools are needed every request: deferring tools you always use forces a search penalty with no savings.
  • Latency is the main constraint: real-time pipelines may not tolerate extra inference passes for tool discovery.
  • Tool search accuracy is low: poor search hits miss tools, and that hurts task completion more than cache breaks cost.

When this does not apply

Stable tool sets are the right default for multi-turn agents. In a few cases, dynamic selection is fine:

  • Single-turn, cold-start requests: when every call is a fresh session with no prior cache, there is no accumulated prefix to protect. Cache continuity only pays off across turns.
  • Local inference without a shared KV cache: some self-hosted backends, for example llama.cpp and Ollama, do not reuse the KV cache across requests. The 10-times cost gap disappears.
  • Very small tool sets (under 5 tools, under 500 tokens total): when tool definitions are tiny next to the message history, the savings from cache hits may not justify a deferred-loading architecture.

In all other cases — multi-turn agents, API-hosted models, or any setup with repeated context — the cost gap dominates and dynamic per-step fetching works against you.

Key Takeaways

  • Any change to tool definitions invalidates the entire KV cache — continuity matters more than minimizing tool count.
  • Prefer deferred loading with a stable core set over dynamic RAG on tool definitions.
  • Audit JSON serialization for non-deterministic key ordering — an accidental cache-breaker.

Example

Anti-pattern — tool definitions change each turn, breaking the cache:

# BAD: tool list rebuilt per step — cache prefix changes every call
for step in plan:
    tools = fetch_tools_for_step(step)          # different subset each time
    response = client.messages.create(
        model="claude-sonnet-4-5",
        tools=tools,                             # cache invalidated every turn
        messages=history,
    )

Fix — stable core tools, deferred discovery via Tool Search:

# GOOD: stable prefix; agent discovers specialized tools on demand
CORE_TOOLS = load_core_tools()                  # same every call

response = client.messages.create(
    model="claude-sonnet-4-5",
    tools=CORE_TOOLS,                           # never changes → cache hits
    messages=history,
)
# Specialized tools are fetched inside message history via Tool Search Tool,
# invalidating nothing above the messages layer.

Sorting tool keys deterministically also prevents accidental cache breaks in languages with non-deterministic dict ordering:

import json

def stable_tool_schema(tool: dict) -> dict:
    return json.loads(json.dumps(tool, sort_keys=True))

CORE_TOOLS = [stable_tool_schema(t) for t in load_core_tools()]
Feedback