Skip to content

Demand-Driven Repository Auditing

Trace specific data flows across function boundaries on-demand instead of analyzing entire codebases. An Initiator-Explorer-Validator architecture finds real bugs at repository scale — $2.54 and 0.44 hours per project on average.

The whole-codebase ingestion problem

Feeding an entire repository into an LLM context window does not scale. A 250K-line C project exceeds any current context limit. Whole-codebase approaches also produce noisy results, because the model has no directed question to answer.

Demand-driven analysis inverts this: start from a suspicious pattern (a potentially null pointer, an allocation without a matching free), then trace only the call chains that matter. The agent reads functions one at a time, following data flow across boundaries, and stops when the flow is resolved or a bug is confirmed.

RepoAudit (code; ICML 2025 poster) demonstrates this on C/C++ memory safety bugs across 15 projects averaging 251K LoC, finding 40 true bugs at 78.43% precision — $2.54 and 0.44 hours per project.

Architecture: Initiator-Explorer-Validator

Three components divide the work so each LLM call has a focused, bounded task:

graph TD
    A[Initiator] -->|suspect sites| B[Explorer]
    B -->|trace across functions| B
    B -->|candidate bug report| C[Validator]
    C -->|confirmed| D[True Bug]
    C -->|rejected| E[False Positive]
    B -.->|cache hit| F[Memory Cache]
    F -.->|cached result| B

Initiator

Pattern-matches source code (via tree-sitter or AST queries) to find suspect sites — locations where a bug could exist. Each suspect site captures file path, line number, tracked variable, and bug category. This is a syntactic filter, not semantic analysis — fast and deterministic.

The initiator also abstracts each function before analysis: the LLM strips irrelevant statements and keeps only those that affect the tracked variable. This improved true positive detection by 47.5% in ablation studies.

Explorer

Takes a suspect site and traces the relevant data flow across function boundaries. At each call site, the explorer:

  1. Reads the callee function
  2. Asks the LLM: "Does this function affect the tracked variable's state?"
  3. If yes, continues tracing into that function
  4. If the flow resolves (variable is checked/freed/initialized), stops — no bug

The explorer follows demand-driven traversal: it only reads functions that appear on the data-flow path, not the entire call graph.

Validator

Receives a candidate bug report and independently verifies it. The validator re-examines the full path the explorer traced, checking for:

  • Path feasibility (can the conditions actually co-occur?)
  • Aliasing (does another variable reference the same memory?)
  • Error handling (is the null case caught by a different mechanism?)

Removing the validator increased false positives by 245.5% in ablation — mechanical re-verification of LLM-generated claims is not optional.

Cache per-function results

Without a cache, the agent re-analyzes the same function repeatedly when multiple suspect sites share functions in their call chains. RepoAudit's memory system caches results as M(function, variable@statement) — a specific variable at a specific point in a specific function. This cut LLM calls by 3 to 30 times depending on the project. It is the main reason repository-scale analysis stays affordable.

Cache key design matters. Function-level granularity alone is too coarse, because the same function may behave differently for different tracked variables. Statement granularity within a function-variable pair is the right level.

Where LLMs add value over traditional tools

Rule-based static analysis tools (Meta Infer, Amazon CodeGuru) struggle with pointer aliasing and path feasibility — the same inter-procedural hard cases Infer's authors flag as scaling challenges. On the RepoAudit benchmark, Infer found 7 true bugs (2 FP) across 8 projects; CodeGuru found 0 true bugs (18 FP); RepoAudit found 40 true bugs (11 FP) across 15.

The LLM advantage concentrates in alias analysis (do two pointers reference the same memory?), path feasibility (can these conditions co-occur?), and cross-function reasoning (how does a callee affect the caller's invariants?) — exactly where rule-based tools produce the most false positives.

Limitations

  • Call chain depth is bounded (RepoAudit uses 4 functions) — deeper inter-procedural bugs are missed
  • Requires language-specific pattern matchers (tree-sitter grammars) for each bug type — not zero-shot
  • Demonstrated only on C/C++ memory safety bugs — applicability to other bug classes (logic errors, race conditions) and other languages is not established by the paper
  • Dynamically-typed languages (Python, JavaScript) make static data-flow tracing harder; the demand-driven approach has not been evaluated outside statically-typed C/C++

Key Takeaways

  • Trace specific data flows on-demand — the agent reads only functions on the path
  • Split into detect (Initiator), trace (Explorer), verify (Validator) — each LLM call has a focused task
  • Removing the validator increased false positives by 245.5% — re-verification is essential
  • Abstract functions before analysis — 47.5% improvement in true positive detection
  • Cache at (function, variable, statement) granularity for affordable repo-scale analysis
  • LLMs outperform traditional tools on alias analysis and path feasibility
Feedback