The Think Tool¶

The think tool is a mid-stream reasoning checkpoint between tool calls, giving agents space to reflect on tool output before deciding the next action.

Related lesson: Reasoning Budget — The Sandwich covers this concept in a hands-on lesson with quizzes.

What the think tool does¶

The think tool fires between tool calls — after the agent receives a tool's output, before it decides what to do next. It differs from extended thinking, which reasons before the first generation token. Extended thinking is pre-action. The think tool is mid-stream: it fires after the agent has observed new information from the environment.

Anthropic's think tool post reports a 54% relative improvement over baseline on the τ-Bench airline domain benchmark, from adding the think tool plus tuned prompting. That is a large effect for a structural change that adds no new capabilities.

When it helps¶

The think tool adds value in sequential workflows where each step depends on the output of the previous one:

After receiving tool output that may change the plan, for example when a file does not exist or a test fails for an unexpected reason
Before a branching decision where different tool outputs require different next steps
When policy compliance needs explicit checking against what the tool returned
When the agent must reconcile multiple constraints before acting

It does not help when tool calls are independent and parallel, because there is nothing to reflect on between independent calls. See the anti-pattern on reasoning overuse.

How it works¶

The agent invokes the think tool as a regular tool call. The model writes a thought, keeps it in context, and draws on that reasoning to formulate the next action. The user does not see the thought.

The tool only fires when the model chooses to use it. If the task is simple or the next step is obvious, the model skips it, so token overhead scales with how often reflection is actually needed.

Why it works¶

Separating observation from action-selection forces implicit state into the context as text. A model that must immediately emit the next tool call carries unverified interpretations of the previous output in the residual stream. The think call turns those interpretations into tokens, so the model can check policy constraints and weigh candidate next steps before committing. This is the same mechanism as chain-of-thought prompting (Wei et al., 2022), applied at the inter-tool boundary. That is why τ-Bench airline tasks gained 54% while its retail domain — with lighter constraints — gained only 3.7%.

Token budget¶

The cost is the tokens each thought consumes. The practical tactic is to make the tool available but not mandatory, so the model self-selects when to use it. On tasks that need frequent reflection, the accuracy gain usually justifies the cost. On tasks where reflection is rarely needed, the overhead stays low.

System prompt requirements¶

The tool alone is not enough. Anthropic's post reports that a generic instruction yields modest gains, while a system prompt with explicit examples of good mid-stream reasoning in the target domain produces the largest gains. Monitor what the model writes and refine the prompt based on quality gaps.

Prefer extended thinking on modern Claude models¶

Anthropic now recommends extended thinking over a dedicated think tool in most cases. On Claude Sonnet and Opus 4.x, adaptive thinking scales reasoning depth to the difficulty of each step and further supersedes the pattern. Reach for the think tool when extended thinking is unavailable — older model versions, API tiers without access, or deployments where mid-stream checkpoints must be inspectable as discrete tool calls rather than hidden reasoning tokens.

When this backfires¶

The think tool adds cost without benefit in several conditions:

Modern Claude models with native reasoning: extended thinking and adaptive thinking subsume the think tool, so a custom implementation on these models is redundant
Parallel or independent tool calls: with no accumulated context to reconcile, a think call spends tokens without changing the decision
Low-constraint sequential tasks: the 54% gain is specific to high-constraint, multi-branch domains, and on τ-Bench's retail domain the gain was only 3.7%
Well-defined decision trees: when the system prompt already encodes the path, a think step can prompt the model to re-examine resolved choices and add unnecessary caveats
No domain-specific prompting: without concrete examples of good mid-stream reasoning in your domain, think output is verbose but empty
High-frequency loops: per-step token overhead accumulates fast and can outweigh accuracy gains on latency- or cost-sensitive pipelines

Example¶

The following tool definition adds the think tool to a Claude API request alongside a domain-specific Bash tool. The system prompt instructs the agent to use it at the right moment.

tools = [
    {
        "name": "think",
        "description": (
            "Use this tool to reason about what you just observed before deciding "
            "your next action. Call it after receiving unexpected tool output, "
            "before choosing between multiple possible next steps, or when you need "
            "to check whether a policy constraint applies to the current situation."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "thought": {
                    "type": "string",
                    "description": "Your reasoning about the current situation."
                }
            },
            "required": ["thought"]
        }
    },
    {
        "name": "bash",
        "description": "Run a shell command and return stdout/stderr.",
        "input_schema": {
            "type": "object",
            "properties": {"command": {"type": "string"}},
            "required": ["command"]
        }
    }
]

The system prompt pairs with the tool to guide when reflection is valuable:

After each bash result, call the think tool if:
- the output differs from what you expected
- you need to choose between two or more possible next commands
- you must verify a constraint before proceeding (e.g., confirming no destructive operation)

Without this prompt guidance, the model may invoke think too rarely on novel outputs. The tool definition and the system prompt together reproduce the conditions under which Anthropic observed the 54% benchmark improvement.

Key Takeaways¶

The think tool is mid-stream reasoning after tool output — distinct from extended thinking (pre-generation)
Adding the think tool plus domain-specific prompting produced a 54% relative improvement on τ-Bench airline tasks; the mechanism is explicit state materialization between observation and decision
Anthropic now recommends extended thinking (and adaptive thinking on Claude 4.x) over the dedicated think tool in most cases; the custom tool is most useful when native reasoning is unavailable
The tool is only invoked when the model judges reflection is needed; token cost scales with actual usage
It adds value in sequential workflows with interdependent steps; not in independent parallel tool calls
Domain-specific examples in the system prompt are required to realize the full performance gain
Avoid it on simple sequential tasks, well-defined decision trees, or sub-second latency pipelines where the cost outweighs the benefit