Skip to content

Cloud-Agent Tiered Model Routing

Route a cloud-agent session to the cheap tier only when the task is scoped, the team has per-tier quality telemetry, and rework cost is bounded — the multiplier savings vanish the moment a Haiku-tier session is re-dispatched at Sonnet, and the documented router has no escalation path.

Cloud-agent tiered model routing assigns each session-scope task to a capability tier — frontier, standard, or fast/cheap — at dispatch time, before the agent claims the issue. GitHub's Copilot cloud agent ships this as a per-session model picker after the 2026-05-18 changelog added Claude Haiku 4.5 and GPT-5.4 mini at a 0.33x multiplier (GitHub Changelog 2026-05-18). Billing is session-scope, one premium request per session at the model's multiplier (GitHub Docs: Copilot requests) — the tier decision is per-task economics, not per-turn.

The cloud-agent variant is narrower than three nearby axes: Utility-Model Split splits within one user turn, Auto Model Selection is vendor-side per-request brokering, and Cost-Aware Agent Design tiers per-task across an in-process harness. Here, the operator picks the tier once, the session runs end-to-end on that tier, and no in-session escalation is documented.

Four Conditions for the Cheap Tier

All four must hold, or the cheap default is a net loss:

  • Bounded task scope. Cheap-tier sessions fit dependency bumps, changelog wording, small refactors, and single-issue fixes — and exclude security-critical work, architectural decisions, and large migrations (Igor's Lab, 2026-05-19).
  • Per-tier quality telemetry. Without PR acceptance rate, retry rate, and reviewer-rejection rate broken down by model_id, regressions hide behind the savings — the "silent quality degradation" failure (Tianpan: LLM Routing).
  • Bounded rework cost. A cheap session that escalates costs 0.297 + 0.9 = 1.197 premium requests vs 0.9 for pinning Sonnet. Above ~25% cheap-tier failure rate, the cheap default is more expensive than the safe one.
  • Picker exposed at the entrypoint. Model selection is "only supported when assigning an issue to Copilot on GitHub.com, when mentioning @copilot in a pull request comment on GitHub.com, or when starting a task from the agents tab, agents panel, GitHub Mobile or the Raycast launcher. Where a model picker is not available, Auto will be used automatically" (GitHub Docs: Changing the AI model).

Tiers and Multiplier Math

The cloud agent currently exposes Auto, Sonnet 4.5, Opus 4.7, Haiku 4.5, GPT-5.2-Codex, and GPT-5.4 mini (GitHub Docs: Changing the AI model).

Model Multiplier Per session under Auto (−10%)
Claude Haiku 4.5 0.33 0.297
GPT-5.4 mini 0.33 0.297
Claude Sonnet 4.5 / 4.6 1 0.9
GPT-5.2-Codex / GPT-5.4 1 0.9
Claude Opus 4.7 15 13.5

Source: GitHub Docs: Copilot requests. Each @copilot steering comment also bills at the session's tier, so a Haiku session with five rounds (5 × 0.33 = 1.65) costs more than a clean Sonnet session (1 × 1 = 1.0).

Routing Signals Before Dispatch

The cloud agent ships no automatic task-complexity classifier — the task-optimised Auto variant is "generally available in Copilot Chat in VS Code" only (GitHub Docs: Auto Model Selection). For cloud-agent sessions, the operator is the classifier. File count and reviewer history dominate: single-file edits and dependency bumps map to the cheap tier; multi-file refactors and cross-module renames do not. If the team's last 10 Haiku PRs landed in one round each, the next one likely will too; if three needed rework, raise the tier. When in doubt, default up — misrouting up wastes inference, misrouting down wastes review time.

graph TD
    I[Issue assigned] --> S{Bounded scope?}
    S -->|No| F[Pin Sonnet or Opus]
    S -->|Yes| T{Quality telemetry?}
    T -->|No| F
    T -->|Yes| R{Rework rate<br>under 25%?}
    R -->|No| F
    R -->|Yes| C[Pick Haiku 4.5<br>or GPT-5.4 mini]

Why It Works

LLM pricing spans two orders of magnitude across tiers, but capability scales sub-linearly with price — most queries do not need frontier capability (Tianpan: LLM Routing). For short, scoped coding tasks the capability floor sits well below the frontier: Anthropic claims Haiku 4.5 "delivers similar levels of coding performance to Sonnet 4 but at one-third the cost and more than twice the speed" (Anthropic: Claude Haiku 4.5). FrugalGPT demonstrates the upper bound for the cascade form: up to 98% cost reduction at GPT-4 quality. The cloud-agent variant is the manual, human-classified instance.

When This Backfires

  • No in-session escalation. A failed cheap-tier PR is caught at human review, after the premium request has billed. Re-dispatching at Sonnet pays both multipliers (~1.2 vs 0.9).
  • Router collapse. arxiv:2602.03478 shows that "routers systematically default to the most capable and most expensive model even when cheaper models already suffice" as cost budgets rise — the human picker reverts to the safe default under shipping pressure.
  • Mid-session swap regressions. GitHub warns "Switching models mid-session has shown increased cost without ample improvements in quality" (GitHub Docs: Auto Model Selection). One VS Code reporter saw "repeated mistakes on things I'd corrected multiple times" from Auto's silent swaps, resolved only by pinning Sonnet (microsoft/vscode#285064).
  • Inherited triggers bypass the picker. Webhook automation and third-party orchestrators fall through to Auto's reliability-only variant, which optimises for pool health, not task fit.
  • Long-context multi-file refactors. Anthropic's "comparable to Sonnet 4" framing benchmarks favour short-context tasks; the canonical cloud-agent workload is where the capability gap widens.

Example

A platform team splits cloud-agent dispatch into two issue classes:

  • Type A (dependency bumps, changelog wording): bounded scope, single-file edits, reviewer-rejection rate under 10% on the last 20 Haiku PRs → pin Claude Haiku 4.5 (0.297 premium requests per session).
  • Type B (cross-service refactors with API contract changes): multi-file edits, rejection rate ~35% on past Haiku PRs → pin Claude Sonnet 4.5 (0.9 per session).

The dispatch rule lives in the team's runbook, not in code — the cloud agent has no programmatic model-pinning API at session start. The runbook tracks the cheap-tier failure rate monthly; when it crosses 25% on a given class, that class moves to the Sonnet default until the rate recovers.

Key Takeaways

  • The Copilot cloud agent's cheap tier (Haiku 4.5, GPT-5.4 mini at 0.33x) is operator-dispatched, not classifier-dispatched — the human picker is the routing signal.
  • Four conditions have to hold together: bounded scope, per-tier quality telemetry, rework cost under ~25% escalation rate, and a picker-exposed entrypoint.
  • No documented in-session escalation; a failed cheap-tier PR re-dispatched at Sonnet costs ~1.2 premium requests vs 0.9 for pinning Sonnet up front.
  • Each steering comment bills at the session's tier — multi-round cheap-tier sessions can exceed a clean single-round frontier session.
  • Track per-model_id PR acceptance, retry, and reviewer-rejection rates — without that telemetry, regressions hide behind the multiplier savings.
Feedback