Decomposing Agent Output Variability by Layer (Sampling vs Orchestration State)¶
Run-to-run agent variability has at least three distinct layers — separate them before picking a mitigation, because lowering temperature does not fix an orchestration-state cascade.
This pattern pays off only when three conditions hold: the agent runs a real multi-step orchestration loop (not a single model call), you control enough of the stack to act on the attribution, and the mitigation cost is justified by the decision. Single-shot or hosted-API-only workloads should skip to pass@k and pass^k metrics and accept the aggregate spread.
When they hold, decomposition is the framework Hydari & Iqbal (2026) propose for validating non-deterministic coding agents: trace the variability to the layer where the mitigation lives, instead of tuning whichever knob is most visible [Source: Hydari & Iqbal, The Token Not Taken].
The Three Layers¶
| Layer | What varies | What fixes it |
|---|---|---|
| Intrinsic (token sampling) | Per-step stochastic selection over the next-token distribution at temperature > 0 | Lower temperature, fix the seed (where supported), greedy decode |
| Extrinsic (infrastructure) | Floating-point reduction order across hardware, kernels, and server batch size — drifts even at temperature 0 | Batch-invariant kernels, pinned hardware, fixed-batch inference |
| Orchestration state | Across-step accumulation of tool outputs, errors, and context that conditions every subsequent step | Tighter system prompt, deterministic tools, narrower state surface, retries with reset |
Practitioners reach for the intrinsic layer first because temperature is the most visible knob. The extrinsic layer is invisible from the API surface: Thinking Machines (2025) showed that even at temperature 0 hosted endpoints drift because the forward pass is not batch-invariant — server batch size changes the floating-point reduction order in normalisation, matmul, and attention. The orchestration-state layer compounds: each step's intrinsic noise becomes the next step's deterministic input.
Why It Works¶
A single trajectory cannot distinguish the layers because each step's intrinsic noise becomes the next step's deterministic state. The paper's mechanism: "sampling introduces stochasticity at each token; when agents iterate, this compounds across steps because state encodes all prior decisions" [Source: Hydari & Iqbal]. Holding one layer constant while varying the others gives the attribution:
- Isolate intrinsic: fix the prompt and tool sequence; vary the seed (or run many times at fixed temperature). Remaining variance is sampling-driven.
- Isolate extrinsic: fix the prompt, run at temperature 0, and vary the server-side batch size (or compare inference backends). Remaining variance is infrastructure-driven.
- Isolate orchestration state: fix the prompt and use greedy decoding; perturb the tool outputs or context order between runs. Remaining variance is state-driven.
The independent agentic-eval study found single-run pass@1 standard deviations exceeding 1.5 percentage points even at temperature 0, "because trajectories diverge early, often within the first few percent of tokens, and these small differences cascade into different solution strategies" [Source: arXiv:2602.07150]. That cascade is the orchestration-state layer; the early divergence is the intrinsic layer that seeds it.
When This Backfires¶
The decomposition adds cost — at least one extra controlled run per layer isolated, plus the infrastructure to vary one layer at a time. It does not pay off in several conditions:
- Hosted-API consumers with no infrastructure control: Anthropic, OpenAI, and Google do not expose batch size, stable seeds, or kernel variants to API clients in 2026. Intrinsic and extrinsic attribution is operationally indistinguishable; the only mitigation is multi-run characterisation regardless of layer.
- Single-step or short-horizon agents: when the agent emits one model call with no tool-loop iteration, the orchestration-state layer is empty — there is no across-step state for Markov-chain reliability to model. Attribution adds effort without changing what you can do about the variance.
- Latency-critical or single-shot tasks: CI gating, code completions, and one-shot migration scripts cannot pay the multi-run cost. Fall back to aggregate pass@k and pass^k metrics, tighten guardrails so any single run is acceptable, and do not decompose its variance.
- External non-determinism dominant: when tool outputs are themselves stochastic (web search, third-party APIs, timing-dependent state), the dominant variance is outside all three layers, and the framework misattributes external noise to whichever layer happens to be isolated when the external source is quiet.
- Model-version drift underneath: silent endpoint updates and A/B routing introduce a fourth layer (model identity) the paper does not address. If you suspect drift, compare a frozen baseline against the live endpoint with behavioral testing before attributing variance to any named layer.
The dominant misattribution is treating a batch-invariance drift as a temperature problem and lowering temperature until the variance is too small to notice — leaving the bug in place and burning capability headroom. The second is treating an orchestration-state cascade as flaky-test variance and re-running for a green build, which never converges because the state surface keeps growing.
Key Takeaways¶
- Run-to-run agent variability is not one signal — it is at least three layers (token sampling, infrastructure, orchestration state) that require different mitigations.
- A single trajectory cannot distinguish the layers; isolate one at a time by holding the others fixed.
- Temperature 0 does not give you reproducibility on hosted endpoints because the forward pass is not batch-invariant — variability at T=0 is an extrinsic signal, not a sampling one.
- Decomposition pays off only with a multi-step orchestration loop, enough stack control to act, and mitigation cost justified by the decision. Otherwise, use aggregate metrics and accept the spread.
- The two costly misattributions are lowering temperature to mask a batch-invariance bug and re-running tests to mask an orchestration-state cascade.