Corpus-Level Trace Diagnostics for LLM Agents¶

A trace corpus past a few hundred runs needs population-level analysis: a scout-investigator pipeline proposes recurring failure hypotheses, then verifies each against corpus evidence.

Corpus-level trace diagnostics runs a structured multi-agent pipeline over a large set of agent execution traces to surface systematic failure patterns — recurring tool misuse, silent reward hacking, drift after long context — invisible when a human inspects one failing trace at a time. It sits above per-trace error analysis, not in place of it. For the statistical route to the same corpus-level question — clustering judge-labeled traces instead of agent-proposed hypotheses — see Macro Evals for Agentic Systems.

When it applies¶

Apply only when all three conditions hold:

Corpus size ≥ ~100 comparable traces — below this, a human reading every trace outperforms automation. LangSmith caps a single Insights Agent run at 1,000 traces. [Source: LangChain — Insights Agent and Multi-turn Evals]
Traces are long or multi-turn — value comes from patterns hidden across many tool calls. Response-level error analysis serves sub-1k-token interactions better. [Source: Hamel Husain]
A human validates findings — generated "insights" are LLM outputs and can fabricate plausible-but-wrong patterns. Without sampled expert review, the pipeline entrenches false beliefs. [Source: Manglik et al., Insights Generator (arxiv 2605.21347)]

If any condition fails, stay with manual error analysis on a focused sample.

Scout-investigator architecture¶

graph LR
    C[Trace<br>Corpus] --> S[Scout]
    S -->|proposes<br>hypotheses| I[Investigator]
    I -->|tests against<br>corpus| E[Evidence-Backed<br>Findings]
    E -->|sampled review| H[Human Expert]

Scout surveys traces in a wide, cheap pass and emits candidate failure-mode hypotheses (tool X is consistently called before tool Y in failed runs).
Investigator queries the corpus for supporting and counter-evidence on one hypothesis at a time, then promotes it with linked trace IDs or discards it.
Human expert reviews a sampled subset to filter fabricated patterns before you treat findings as ground truth.

The split echoes the division of labor in Anthropic's multi-agent research system, where a lead agent delegates to specialized subagents: a broad surveyor generates hypotheses, a narrow verifier cuts false positives.

Why it works¶

Agent failures are governed by a few recurring causal patterns (tool misuse, context drift, reward hacking, missing capability), but each surfaces with high per-trace variance. Aggregating across many traces averages out task-specific noise so the signal becomes detectable.

The scout-investigator split keeps this safe at scale: scout pattern recognition is cheap but error-prone; investigator verification is expensive but precise. Composed, they mirror clinical diagnostic reasoning — broad differential, then targeted confirmation. Reported payoff: 30.4 percentage points scaffold improvement over baseline across HLE, SWeBench Pro, TerminalBench, and FeatureBench. [Source: Manglik et al., Insights Generator (arxiv 2605.21347)]

Where it sits among diagnostic layers¶

Layer	Granularity	Use when
Response-level error analysis	Single output	Building the first eval suite [Hamel Husain]
Per-trajectory decomposition	Single trace, stage-split	Localizing the failing stage [TRAJEVAL]
Cross-trace clustering	Production corpus	Sizing error modes by frequency [LangChain]
Corpus-level scout-investigator	Production corpus	Evidence-backed natural-language findings [Manglik et al.]

The layers compose; they do not replace each other.

When this backfires¶

Small or fast-iterating corpora — under ~100 traces, scout-investigator overhead exceeds the benefit. Hamel Husain's 30 manually-read traces remains the right tool. [Source: Hamel Husain]
Highly heterogeneous workloads — clustering depends on comparable traces. A corpus mixing code review, RAG QA, and data exploration produces weak clusters and generic insights.
No human review — generated insights can fabricate plausible-but-wrong patterns. Without sampled expert review, the pipeline entrenches false beliefs.
Privacy-sensitive traces — production traces containing PII or proprietary code expand the exposure surface when sent through a hosted pipeline.
Cost and latency — a LangSmith Insights Agent run takes up to 15 minutes for insights and up to 30 minutes for the full report. [Source: LangChain]

Workflow¶

Decide the corpus. Filter to a comparable trace set (same agent version, task family, time window).
Define the diagnostic question. Why do tool-use tasks fail at step 6-15? beats Why does the agent fail? — failures cluster in mid-trajectory steps where early missteps cascade. [Source: Where LLM Agents Fail (arxiv 2509.25370)]
Run the scout pass. A cheap model emits capped candidate hypotheses (tool X precedes tool Y in failed runs) — long lists dilute investigator effort.
Run the investigator pass. For each hypothesis, fetch supporting and counter-evidence traces. Discard any hypothesis with strong counter-examples or no corroboration.
Sample-review findings. A domain expert reads 10-20% of cited traces per finding before any insight becomes ground truth. This step is not optional.
Convert findings into eval cases. Each confirmed pattern becomes a regression eval — see incident-to-eval synthesis.

Key Takeaways¶

Corpus-level trace diagnostics is the layer above per-trace error analysis — apply only when corpus size, trace length, and human review capacity justify the overhead.
The scout-investigator split is what makes automated pattern discovery safe: broad cheap proposal followed by targeted expensive verification.
The reported expert gain (30.4pp scaffold improvement) is real but conditional on human review of generated findings — without it, the pipeline manufactures false patterns.
Composes with, does not replace, manual error analysis on small samples and per-trajectory decomposition on single failures.

Using the Agent to Analyze Its Own Evaluation Transcripts — agent-as-analyst on a smaller transcript batch, focused on tool design changes
Trajectory Decomposition: Diagnose Where Coding Agents Fail — per-trace stage decomposition; the layer below corpus-level diagnostics
Incident-to-Eval Synthesis — convert each confirmed failure pattern into a regression eval case
LLM Agent Bug Fix Taxonomy — empirical failure-mode taxonomy from 930 real agent bugs
Learned Prefix Monitors for Agent Traces — online failure detection at the trace level; complements offline corpus-level analysis
Macro Evals for Agentic Systems — the statistical route to the same population-level question: clustering judge-labeled traces (UMAP/HDBSCAN at ~1,000+ traces) where this page runs hypothesis-driven agents from ~100 traces up