Predicting Reviewable Code: Pre-Flagging Functions Reviewers Will Delete¶

AI-generated code produces functions that are routinely deleted during PR review; predictive models can identify likely-to-be-deleted functions before reviewers spend time examining them.

The review burden shift¶

Agentic coding tools shift work from writing to reviewing. When an agent generates a PR, reviewers must examine code they will ultimately delete — dead code, over-engineered helpers, spec-mismatched implementations. arXiv:2602.17091 shows AI-generated PRs contain a notable portion of functions deleted during review, with deletion reasons producing distinct structural characteristics predictable at AUC 87.1%. Reviewers are spending time on code a pre-filter could have flagged first.

Deletion reason categories (author-derived taxonomy)¶

arXiv:2602.17091 identifies structural features that distinguish deleted from surviving functions — method name length, lines of code, Halstead volume, and call count — but does not name deletion-reason categories. The taxonomy below is author-derived, organizing those structural signals into three practitioner-facing buckets to make the predictors actionable. Treat the category names as framing, not findings.

Dead code: functions generated but never called from the PR's entry points. This maps to the paper's call-count signal — functions with fewer inbound references (arXiv:2602.17091).

Over-engineering: functions that introduce abstraction the spec did not require — utility helpers, base classes, factory patterns for single-instantiation objects. This maps to the paper's three strongest predictors (longer method names, higher line counts, greater Halstead volume) (arXiv:2602.17091), which together signal more generated code than the task required.

Spec mismatch: functions that implement different behavior than the spec required — wrong signature, wrong return type, wrong preconditions. The paper does not identify this directly. We include it because type-contract divergence is a separate failure mode that structural metrics alone will not catch.

Each bucket calls for a different remediation signal sent back to the agent.

Why it works¶

Structural metrics expose scope overreach before a reviewer reads a single line. arXiv:2602.17091 found the strongest predictors of deletion are method name length (word count), total lines of code, and Halstead volume — all proxies for "more was generated than the task required." A function with a long descriptive name and high Halstead volume encodes more conceptual surface area than a focused one. That excess surface area is what reviewers remove. The model reaches AUC 87.1% using only these static, syntax-level features — it needs no semantic understanding of the spec to flag probable deletions.

Applying predictive pre-flagging¶

Before routing a generated PR to human review, run structural analysis to identify high-deletion-probability functions:

graph TD
    A[Agent generates PR] --> B[Call graph analysis]
    B --> C[Dead code detector]
    A --> D[Spec coverage check]
    D --> E[Spec mismatch detector]
    A --> F[Complexity vs spec scope]
    F --> G[Over-engineering detector]
    C --> H[Pre-flag report]
    E --> H
    G --> H
    H --> I{Flags above threshold?}
    I -->|Yes| J[Return report]
    I -->|No| K[Human review]

The pre-flag report tells the reviewer where to focus. It can also return flagged functions to the agent for regeneration before a human spends time on them.

Implications for agent scope instructions¶

The research outcome is a direct input to agent prompting. Configure your agent's scope instructions to target each deletion category:

Emit only called code: require that generated functions are reachable from specified entry points
Match spec scope: instruct the agent not to abstract beyond what the current task requires
Declare external dependencies explicitly: flag functions that depend on context outside the PR rather than letting the agent silently generate them

Fewer generated functions that survive review beats more functions with a higher deletion rate.

Example¶

This script demonstrates dead code detection via call-graph reachability — identifying functions in a generated module never called from the PR's entry point, the most mechanically detectable deletion category.

import ast
import sys
from pathlib import Path

def get_defined_functions(source: str) -> set[str]:
    tree = ast.parse(source)
    return {node.name for node in ast.walk(tree) if isinstance(node, ast.FunctionDef)}

def get_called_functions(source: str) -> set[str]:
    tree = ast.parse(source)
    return {node.func.id for node in ast.walk(tree)
            if isinstance(node, ast.Call) and isinstance(node.func, ast.Name)}

def flag_dead_code(filepath: str) -> list[str]:
    source = Path(filepath).read_text()
    defined = get_defined_functions(source)
    called = get_called_functions(source)
    # Entry point functions (e.g. main, handler) are excluded from the dead-code check
    entry_points = {"main", "handler", "lambda_handler"}
    dead = defined - called - entry_points
    return sorted(dead)

if __name__ == "__main__":
    dead = flag_dead_code(sys.argv[1])
    if dead:
        print("Pre-flag: likely dead code (never called within module):")
        for fn in dead:
            print(f"  - {fn}")
        sys.exit(1)
    print("No dead code detected.")

Running this against a generated module before routing to review:

python flag_dead_code.py generated_module.py
# Pre-flag: likely dead code (never called within module):
#   - build_cache_key
#   - _legacy_format

These two functions would be candidates for deletion. Returning this report to the agent — rather than a human reviewer — eliminates the review cycle for spec-mismatched generated code before a human sees it.

When this backfires¶

Pre-flagging adds value when the cost of reviewer time exceeds the cost of running structural analysis, but several conditions undermine that trade-off:

Infrastructure and setup functions: functions not yet called within the PR — setup hooks, migration helpers, exported API surface — will appear as dead code to a call-graph analyzer. Treat entry-point configuration as a first-class parameter, not an afterthought.
Cross-file call graphs are expensive: dead code detection that only inspects the generated module (as in the flag_dead_code example above) misses legitimate calls from existing files. Building a full project call graph adds pipeline latency and may require language-specific tooling.
Single-study generalization risk: the AUC 87.1% result comes from one codebase and one AI model. Feature importance will differ across languages, project types, and model generations — validate false-positive rates locally before routing suppressions to the agent.
False negatives pass bad code unexamined: a 12.9% error rate leaves roughly 1-in-8 deletable functions unflagged. Reviewers who lean on the report may skip unflagged code too quickly, raising the cost of each missed deletion.
False positives block valid abstractions: a utility called only once looks like over-engineering by metrics but may be essential for testability or extension. Flags routed back to the agent can regenerate away intentional design decisions — the inverse risk to the abstraction bloat the pattern targets.
Feedback loop without calibration: returning flags for regeneration without calibrating "spec scope" can cause under-generation in later tasks. A regeneration limit and human fallback prevent loops.

Key Takeaways¶

AI-generated PRs shift the bottleneck from writing to reviewing; predictive pre-filtering reduces that shift's cost
The paper shows deletion likelihood is statistically predictable from structural features (method name length, LOC, Halstead volume, call count); the dead-code, over-engineering, and spec-mismatch grouping is author-derived framing, not a paper result
Agent scope instructions should target the root causes: require reachability, prohibit over-abstraction, match spec scope
Pre-flag reports returned to the agent before human review cut total review cost

Agent-Assisted Code Review
Agentic Code Review Architecture
Diff-Based Review Over Output Review
Signal Over Volume in AI Review
Tiered Code Review: AI-First with Human Escalation
Risk-Based Task Sizing for Agent Verification Depth
Abstraction Bloat — the training incentive that produces over-engineered code and drives the over-engineering deletion category
Agent PR Volume vs. Value