Demand-Driven Repository Auditing¶

Trace specific data flows across function boundaries on-demand instead of analyzing entire codebases. An Initiator-Explorer-Validator architecture finds real bugs at repository scale — $2.54 and 0.44 hours per project on average.

The whole-codebase ingestion problem¶

Feeding an entire repository into an LLM context window does not scale. A 250K-line C project exceeds any current context limit. Whole-codebase approaches also produce noisy results, because the model has no directed question to answer.

Demand-driven analysis inverts this: start from a suspicious pattern (a potentially null pointer, an allocation without a matching free), then trace only the call chains that matter. The agent reads functions one at a time, following data flow across boundaries, and stops when the flow is resolved or a bug is confirmed.

RepoAudit (code; ICML 2025 poster) demonstrates this on C/C++ memory safety bugs across 15 projects averaging 251K LoC, finding 40 true bugs at 78.43% precision — $2.54 and 0.44 hours per project.

Architecture: Initiator-Explorer-Validator¶

Three components divide the work so each LLM call has a focused, bounded task:

graph TD
    A[Initiator] -->|suspect sites| B[Explorer]
    B -->|trace across functions| B
    B -->|candidate bug report| C[Validator]
    C -->|confirmed| D[True Bug]
    C -->|rejected| E[False Positive]
    B -.->|cache hit| F[Memory Cache]
    F -.->|cached result| B

Initiator¶

Pattern-matches source code (via tree-sitter or AST queries) to find suspect sites — locations where a bug could exist. Each suspect site captures file path, line number, tracked variable, and bug category. This is a syntactic filter, not semantic analysis — fast and deterministic.

The initiator also abstracts each function before analysis: the LLM strips irrelevant statements and keeps only those that affect the tracked variable. This improved true positive detection by 47.5% in ablation studies.

Explorer¶

Takes a suspect site and traces the relevant data flow across function boundaries. At each call site, the explorer:

Reads the callee function
Asks the LLM: "Does this function affect the tracked variable's state?"
If yes, continues tracing into that function
If the flow resolves (variable is checked/freed/initialized), stops — no bug

The explorer follows demand-driven traversal: it only reads functions that appear on the data-flow path, not the entire call graph.

Validator¶

Receives a candidate bug report and independently verifies it. The validator re-examines the full path the explorer traced, checking for:

Path feasibility (can the conditions actually co-occur?)
Aliasing (does another variable reference the same memory?)
Error handling (is the null case caught by a different mechanism?)

Removing the validator increased false positives by 245.5% in ablation — mechanical re-verification of LLM-generated claims is not optional.

Cache per-function results¶

Without a cache, the agent re-analyzes the same function repeatedly when multiple suspect sites share functions in their call chains. RepoAudit's memory system caches results as M(function, variable@statement) — a specific variable at a specific point in a specific function. This cut LLM calls by 3 to 30 times depending on the project. It is the main reason repository-scale analysis stays affordable.

Cache key design matters. Function-level granularity alone is too coarse, because the same function may behave differently for different tracked variables. Statement granularity within a function-variable pair is the right level.

Where LLMs add value over traditional tools¶

Rule-based static analysis tools (Meta Infer, Amazon CodeGuru) struggle with pointer aliasing and path feasibility — the same inter-procedural hard cases Infer's authors flag as scaling challenges. On the RepoAudit benchmark, Infer found 7 true bugs (2 FP) across 8 projects; CodeGuru found 0 true bugs (18 FP); RepoAudit found 40 true bugs (11 FP) across 15.

The LLM advantage concentrates in alias analysis (do two pointers reference the same memory?), path feasibility (can these conditions co-occur?), and cross-function reasoning (how does a callee affect the caller's invariants?) — exactly where rule-based tools produce the most false positives.

Limitations¶

Call chain depth is bounded (RepoAudit uses 4 functions) — deeper inter-procedural bugs are missed
Requires language-specific pattern matchers (tree-sitter grammars) for each bug type — not zero-shot
Demonstrated only on C/C++ memory safety bugs — applicability to other bug classes (logic errors, race conditions) and other languages is not established by the paper
Dynamically-typed languages (Python, JavaScript) make static data-flow tracing harder; the demand-driven approach has not been evaluated outside statically-typed C/C++

Key Takeaways¶

Trace specific data flows on-demand — the agent reads only functions on the path
Split into detect (Initiator), trace (Explorer), verify (Validator) — each LLM call has a focused task
Removing the validator increased false positives by 245.5% — re-verification is essential
Abstract functions before analysis — 47.5% improvement in true positive detection
Cache at (function, variable, statement) granularity for affordable repo-scale analysis
LLMs outperform traditional tools on alias analysis and path feasibility

Deterministic Guardrails Around Probabilistic Agents — the validator component applies this pattern to static analysis
Incremental Verification: Check at Each Step, Not at the End — the Explorer checks at each function boundary rather than deferring verification to the end of the trace
Coverage-Guided Agents for Fuzz Harness Generation — another agent-driven code analysis technique
Layered Accuracy Defense — the Initiator-Explorer-Validator split applies layered verification where each stage catches errors the previous stage is not designed to catch
Five-Pass Blunder Hunt — repeated review passes finding progressively deeper issues