Skip to content

Reasoning Budget Allocation: The Reasoning Sandwich

Allocate maximum reasoning compute to planning and verification phases, reduced compute to execution — rather than using a fixed level throughout.

Learn it hands-on: Reasoning Budget — The Sandwich — guided lesson with quizzes.

The pattern

Not all steps in an agent workflow need the same depth of reasoning. Planning and verification are high-stakes. Execution is largely mechanical.

LangChain's deep agent experiments tested a "reasoning sandwich" — extra-high at planning, high at execution, extra-high at verification (xhigh-high-xhigh). It scored highest on Terminal Bench 2.0 (66.5%), beating both continuous maximum reasoning (53.9%, penalized by timeouts) and uniform high reasoning (63.6%).

graph LR
    A[Planning<br/>Extra-high compute] --> B[Execution<br/>High compute]
    B --> C[Verification<br/>Extra-high compute]

Phase breakdown

Planning gets extra-high compute. Map the problem space: requirements, approach, and risks. Errors here propagate through every later step.

Execution gets high compute. Follow the plan: write code and run commands. Reduced compute handles mechanical steps while lowering per-step cost.

Verification gets extra-high compute. Check output against requirements and run tests. A missed failure produces false completion.

Dual-mode operation

The OPENDEV paper builds the sandwich into the architecture through two modes (Bui, 2026 §2.2.2):

Mode switching triggers via an explicit command (/plan) or planning-intent heuristics. This maps to the sandwich: plan mode (extra-high compute) to normal-mode execution (high) to verification (extra-high).

An optional thinking phase adds a separate inference call using a dedicated Thinking model before action selection (Bui, 2026 §2.2.6). It amplifies any phase that needs deeper reasoning.

Extended thinking budget triggers

Extended thinking allocates a dedicated reasoning budget before the model generates its response — distinct from the think tool, which reasons mid-stream between tool calls.

In Claude Code, adding "ultrathink" to a skill's content turns on extended thinking, which allocates maximum thinking tokens for that skill.

Maximum thinking as a cost-performance tradeoff

A community analysis treats maximum thinking on a balanced model as an alternative to model-tier upgrades. Exhausting the thinking budget on a cheaper model costs less than switching tiers. Evaluate this tradeoff before you move to a higher-cost tier for reasoning-heavy tasks.

This stacks with other techniques:

  • Extended thinking — maximum reasoning tokens via a trigger keyword
  • Plan mode — structured planning before execution
  • Iterative critique — systematic self-review cycles to catch edge cases

Each layer adds cost. Combine them when the task warrants the investment.

Applying budget triggers

  • Claude Code skills: include "ultrathink" in SKILL.md content to turn on extended thinking
  • Claude API: set the thinking budget parameter per call — high for planning and verification, standard for execution
  • Any tool with model routing: route planning and verification to a capable model, execution to a cheaper one

For tools without per-call configuration, approximate through prompt structure: more reasoning guidance in planning prompts, less in execution.

Why it works

Different phases make structurally different cognitive demands (Bui, 2026 §2.2.5). Planning explores the possibility space and accounts for requirements, edge cases, and risks; errors here propagate downstream. Execution follows a decided plan and is largely mechanical. Verification compares output against requirements precisely, where a missed failure produces false completion. Uniform maximum compute on execution wastes budget on mechanical steps. As the LangChain benchmark showed, it also causes timeouts that lower completion rates. Concentrating compute where ambiguity is highest balances cost against quality.

When to apply

The sandwich pays off when:

  • Tasks have discrete planning, execution, and verification phases
  • Planning mistakes are costly to repair after implementation begins
  • Verification failures would be falsely reported as success

Single-step tasks and independent parallel tool calls see no benefit from added reasoning overhead.

When this backfires

The 3% gap between the sandwich (66.5%) and uniform high (63.6%) does not always justify harness complexity. The sandwich is worse than uniform compute when:

  • Phases are not cleanly separable. Exploratory debugging and interleaved planning and execution force misclassified routing, so the sandwich degrades to noisy uniform compute with routing overhead.
  • Mode switching adds more bugs than it prevents. Teams without the budget for reliable planner, executor, and verifier routing do better with a single tier at uniform high reasoning.
  • Verification is cheap relative to planning. When tests or types check correctness, extra-high model-based verification duplicates what the test harness already does.
  • Execution dominates the trajectory. Bulk refactors and migrations spend most tokens in execution; reducing compute there saves little, while planning and verification contribute a small share of cost.

Key Takeaways

  • Planning and verification warrant extra-high reasoning compute; execution warrants high.
  • The sandwich achieved the highest completion rate (66.5%) in LangChain benchmarks, outperforming continuous maximum reasoning (53.9%, penalized by timeouts) and uniform high reasoning (63.6%).
  • Extended thinking triggers (e.g., "ultrathink" in Claude Code skills) front-load reasoning before generation — distinct from mid-stream think tool reasoning.
  • Maximum-thinking on a balanced model may cost less than a tier upgrade for reasoning-heavy tasks — evaluate before switching tiers.
  • Stack extended thinking with plan mode and iterative critique for tasks that warrant the added cost.
  • Dual-mode operation (plan/normal) enforces the sandwich architecturally by restricting tool access per phase.

Example

A Claude API implementation routing by phase:

def run_sandwich(task: str) -> str:
    # Planning — extra-high thinking budget
    plan = client.messages.create(
        model="claude-opus-4-5",
        thinking={"type": "enabled", "budget_tokens": 10000},
        messages=[{"role": "user", "content": f"Plan: {task}"}],
    )

    # Execution — standard thinking budget
    result = client.messages.create(
        model="claude-opus-4-5",
        thinking={"type": "enabled", "budget_tokens": 2000},
        messages=[{"role": "user", "content": f"Execute plan:\n{plan.content[0].text}\nTask: {task}"}],
    )

    # Verification — extra-high thinking budget
    verdict = client.messages.create(
        model="claude-opus-4-5",
        thinking={"type": "enabled", "budget_tokens": 10000},
        messages=[{"role": "user", "content": f"Verify result meets requirements:\n{result.content[1].text}"}],
    )
    return verdict.content[1].text

In Claude Code skills, add ultrathink to the skill content for planning and verification skills, and omit it for execution skills.

Feedback