Skip to content

Law of Triviality in AI PRs

Reviewers bikeshed small changes and rubber-stamp large ones. AI agents produce large diffs by default, so the code that needs the most scrutiny gets the least.

The pattern

Parkinson's Law of Triviality (1957) says attention scales inversely with complexity. Reviewers scrutinize small diffs and rubber-stamp large ones.

Agents routinely produce PRs past the threshold where review stays effective. Hand-written tweaks attract debate while AI diffs pass unexamined. This differs from PR Scope Creep: the cause is reviewer psychology, not scope.

Defect detection collapses with size

The SmartBear/Cisco study (2,500 reviews) puts optimal review at 100-300 LOC in 30-60 minutes; effectiveness drops past 400. Propel quantifies the drop:

PR Size (lines) Defect Detection Rate Review Time Comments per PR
1-200 87% ~45 min 3.2
101-300 ~70% ~60 min ~4.1
301-600 65% ~2 hr 2.4
1,000+ 28% ~4.2 hr 1.8

Four hours on 1,000 LOC yields fewer comments than 45 minutes on 200. Fatigue causes disengagement, not depth.

AI makes it worse

CodeRabbit finds AI PRs contain 1.7x more issues than human code: 3x more readability issues and 75% more logic defects. Three mechanisms compound the problem:

  • Template blindness: AI output follows familiar patterns, so reviewers skim and bugs hide in boilerplate. (AsyncSquad Labs)
  • AI brain fry: sustained AI oversight produces mental fog and higher error rates. (HBR / Help Net Security)
  • Nyquist under-sampling: code production tripled while review sampling stayed flat, so defects alias as passing. (Bryan Finster)
graph LR
    A[Agent generates<br/>large diff] --> B[Reviewer overwhelmed]
    B --> C[Rubber-stamp approval]
    C --> D[Defects ship]
    D --> E[Trust in review<br/>erodes]
    E --> F[Even less scrutiny<br/>on next PR]
    F --> B

Mitigation stack

1. Constrain batch size

Target 100-300 LOC per PR. Split agent work into atomic commits and enforce size gates in CI.

2. Tiered review

Use tiered code review:

Tier Reviewer Scope
1 Automated (lint, SAST, tests) Syntax, style, known vulnerability patterns
2 AI-augmented review Flag risk hotspots, check for common AI mistakes
3 Human expert Architecture, business logic, domain context

See Agentic Code Review Architecture.

3. Semantic diffing

Review behavior changes, not raw lines. AST diffs and API-contract analysis surface what moved.

4. BDD-first specification

Define expected behavior before the agent codes. Review then becomes validation against pre-agreed criteria. See Spec-Driven Development.

When this backfires

Size limits fail for genuinely atomic changes (cross-cutting refactors, schema migrations), when monorepo coordination exceeds review benefit, or when LOC gates force superficial splits — many small PRs, collectively incoherent.

Example

An agent completes a feature sprint and opens a single 1,400-LOC PR touching auth, billing, and the data model. The reviewer spends 3 hours skimming and approves with two style comments. A logic error in the billing calculation ships.

The same work split into three PRs (auth at 180 LOC, billing at 220 LOC, data model at 160 LOC) would have received an average of 4 or more comments each at an 87% defect detection rate. The billing bug would have been caught.

CI enforcement keeps scope in check:

# .github/workflows/pr-size.yml
- name: Check PR size
  run: |
    LINES=$(git diff --stat origin/main...HEAD | tail -1 | grep -oP '\d+ insertion' | grep -oP '\d+')
    if [ "${LINES:-0}" -gt 400 ]; then
      echo "PR exceeds 400 LOC. Split into smaller atomic PRs."
      exit 1
    fi
Feedback