Cloud-Agent Tiered Model Routing¶

Route a cloud-agent session to the cheap tier only when the task is scoped, the team has per-tier quality telemetry, and rework cost is bounded — the multiplier savings vanish the moment a Haiku-tier session is re-dispatched at Sonnet, and the documented router has no escalation path.

Cloud-agent tiered model routing assigns each session-scope task to a capability tier — frontier, standard, or fast/cheap — at dispatch time, before the agent claims the issue. GitHub's Copilot cloud agent ships this as a per-session model picker after the 2026-05-18 changelog added Claude Haiku 4.5 and GPT-5.4 mini at a 0.33x multiplier (GitHub Changelog 2026-05-18). Billing is session-scope, one premium request per session at the model's multiplier (GitHub Docs: Copilot requests) — the tier decision is per-task economics, not per-turn.

The cloud-agent variant is narrower than three nearby axes: Utility-Model Split splits within one user turn, Auto Model Selection is vendor-side per-request brokering, and Cost-Aware Agent Design tiers per-task across an in-process harness. Here, the operator picks the tier once, the session runs end-to-end on that tier, and no in-session escalation is documented.

Four Conditions for the Cheap Tier¶

All four must hold, or the cheap default is a net loss:

Bounded task scope. Cheap-tier sessions fit dependency bumps, changelog wording, small refactors, and single-issue fixes — and exclude security-critical work, architectural decisions, and large migrations (Igor's Lab, 2026-05-19).
Per-tier quality telemetry. Without PR acceptance rate, retry rate, and reviewer-rejection rate broken down by model_id, regressions hide behind the savings — the "silent quality degradation" failure (Tianpan: LLM Routing).
Bounded rework cost. A cheap session that escalates costs 0.297 + 0.9 = 1.197 premium requests vs 0.9 for pinning Sonnet. Above ~25% cheap-tier failure rate, the cheap default is more expensive than the safe one.
Picker exposed at the entrypoint. Model selection is "only supported when assigning an issue to Copilot on GitHub.com, when mentioning @copilot in a pull request comment on GitHub.com, or when starting a task from the agents tab, agents panel, GitHub Mobile or the Raycast launcher. Where a model picker is not available, Auto will be used automatically" (GitHub Docs: Changing the AI model).

Tiers and Multiplier Math¶

The cloud agent currently exposes Auto, Sonnet 4.5, Opus 4.7, Haiku 4.5, GPT-5.2-Codex, and GPT-5.4 mini (GitHub Docs: Changing the AI model).

Model	Multiplier	Per session under Auto (−10%)
Claude Haiku 4.5	0.33	0.297
GPT-5.4 mini	0.33	0.297
Claude Sonnet 4.5 / 4.6	1	0.9
GPT-5.2-Codex / GPT-5.4	1	0.9
Claude Opus 4.7	15	13.5

Source: GitHub Docs: Copilot requests. Each @copilot steering comment also bills at the session's tier, so a Haiku session with five rounds (5 × 0.33 = 1.65) costs more than a clean Sonnet session (1 × 1 = 1.0).

Routing Signals Before Dispatch¶

The cloud agent ships no automatic task-complexity classifier — the task-optimised Auto variant is "generally available in Copilot Chat in VS Code" only (GitHub Docs: Auto Model Selection). For cloud-agent sessions, the operator is the classifier. File count and reviewer history dominate: single-file edits and dependency bumps map to the cheap tier; multi-file refactors and cross-module renames do not. If the team's last 10 Haiku PRs landed in one round each, the next one likely will too; if three needed rework, raise the tier. When in doubt, default up — misrouting up wastes inference, misrouting down wastes review time.

graph TD
    I[Issue assigned] --> S{Bounded scope?}
    S -->|No| F[Pin Sonnet or Opus]
    S -->|Yes| T{Quality telemetry?}
    T -->|No| F
    T -->|Yes| R{Rework rate<br>under 25%?}
    R -->|No| F
    R -->|Yes| C[Pick Haiku 4.5<br>or GPT-5.4 mini]

Why It Works¶

LLM pricing spans two orders of magnitude across tiers, but capability scales sub-linearly with price — most queries do not need frontier capability (Tianpan: LLM Routing). For short, scoped coding tasks the capability floor sits well below the frontier: Anthropic claims Haiku 4.5 "delivers similar levels of coding performance to Sonnet 4 but at one-third the cost and more than twice the speed" (Anthropic: Claude Haiku 4.5). FrugalGPT demonstrates the upper bound for the cascade form: up to 98% cost reduction at GPT-4 quality. The cloud-agent variant is the manual, human-classified instance.

When This Backfires¶

No in-session escalation. A failed cheap-tier PR is caught at human review, after the premium request has billed. Re-dispatching at Sonnet pays both multipliers (~1.2 vs 0.9).
Router collapse. arxiv:2602.03478 shows that "routers systematically default to the most capable and most expensive model even when cheaper models already suffice" as cost budgets rise — the human picker reverts to the safe default under shipping pressure.
Mid-session swap regressions. GitHub warns "Switching models mid-session has shown increased cost without ample improvements in quality" (GitHub Docs: Auto Model Selection). One VS Code reporter saw "repeated mistakes on things I'd corrected multiple times" from Auto's silent swaps, resolved only by pinning Sonnet (microsoft/vscode#285064).
Inherited triggers bypass the picker. Webhook automation and third-party orchestrators fall through to Auto's reliability-only variant, which optimises for pool health, not task fit.
Long-context multi-file refactors. Anthropic's "comparable to Sonnet 4" framing benchmarks favour short-context tasks; the canonical cloud-agent workload is where the capability gap widens.

Example¶

A platform team splits cloud-agent dispatch into two issue classes:

Type A (dependency bumps, changelog wording): bounded scope, single-file edits, reviewer-rejection rate under 10% on the last 20 Haiku PRs → pin Claude Haiku 4.5 (0.297 premium requests per session).
Type B (cross-service refactors with API contract changes): multi-file edits, rejection rate ~35% on past Haiku PRs → pin Claude Sonnet 4.5 (0.9 per session).

The dispatch rule lives in the team's runbook, not in code — the cloud agent has no programmatic model-pinning API at session start. The runbook tracks the cheap-tier failure rate monthly; when it crosses 25% on a given class, that class moves to the Sonnet default until the rate recovers.

Key Takeaways¶

The Copilot cloud agent's cheap tier (Haiku 4.5, GPT-5.4 mini at 0.33x) is operator-dispatched, not classifier-dispatched — the human picker is the routing signal.
Four conditions have to hold together: bounded scope, per-tier quality telemetry, rework cost under ~25% escalation rate, and a picker-exposed entrypoint.
No documented in-session escalation; a failed cheap-tier PR re-dispatched at Sonnet costs ~1.2 premium requests vs 0.9 for pinning Sonnet up front.
Each steering comment bills at the session's tier — multi-round cheap-tier sessions can exceed a clean single-round frontier session.
Track per-model_id PR acceptance, retry, and reviewer-rejection rates — without that telemetry, regressions hide behind the multiplier savings.

Utility-Model Split — splits background harness calls inside one user turn, complementary to session-level tier routing.
Auto Model Selection — vendor-side per-request brokering; the fallback when the picker is not exposed.
Cost-Aware Agent Design — taxonomic framework for per-task tier routing across an entire harness.
Code-Health-Gated LLM Tier Routing — research proposal using code health as the routing signal at task dispatch.
Gateway Model Routing — the discovery layer beneath any tier-routing decision.
GitHub Copilot Cloud Agent — the cloud-agent surface this routing pattern targets.