Skip to content

Human-Equivalent Hours for Autonomous Coding Agent Productivity

Estimate the human engineering hours an autonomous agent's output would have taken — credible only on PR-gated sessions with a paired downstream signal.

When the metric is credible

Human-equivalent hours estimates counterfactual human time, not measured output. It is credible under specific conditions and misleading outside them. Apply it only when:

  • Sessions terminate in a merged PR or pass an independent quality classifier. Cognition's calibration includes a PR-merged session only if any of its PRs merge; non-PR sessions go through a classifier that drops 1–20% as unproductive (Cognition, 2026-06-04).
  • The aggregate covers enough sessions to escape the noise floor. Their held-out r_log = 0.74 places ~50% of estimates within a factor of 2; below ~20 sessions, month-over-month deltas sit inside that band (Cognition, 2026-06-04).
  • A second, observed signal trends alongside it: PR review time, defect rate, or merge-to-revert ratio. Without one, the metric is unanchored — see when this backfires.

Outside these conditions, prefer cost per merged PR alongside review-time-to-merge — both are observed, not estimated.

The definition

Cognition asks "how long would a human engineer have taken to produce the same output?" Hours already denominate salaries and contractor rates, so the result is directly comparable to existing finance and headcount instruments (Cognition, 2026-06-04).

The estimator rests on four design principles:

  1. Reason about the human's path, not the agent's — discount retries, setup, and non-core artifacts a human would not produce.
  2. Credit only unspecified work — measure the agent's contribution against the user's initial problem statement, not the full diff.
  3. Account for codebase familiarity — infer a human's exploration time in an unfamiliar codebase.
  4. Assume relevant expertise — the reference engineer already has the skills; do not credit skill-substitution.

The uncalibrated model is corrected via log-space linear regression:

h = 2.28 × m^0.923

where m is the uncalibrated estimate and h the corrected human-hours figure (a simplified 2.08× constant performs comparably). Calibration: 258 sessions, 126 users; held-out r_log = 0.74 on 233 sessions; F(1,231) = 279.9, p < 10⁻⁵ (Cognition, 2026-06-04).

Why code volume is not the metric

Regressing lines changed against human-time estimates produces R²_log = 0.27 — code volume captures roughly a quarter of the variance in productive output (Cognition, 2026-06-04). That is the empirical case against task-completion-rate and PR-count metrics: they correlate weakly with the value the team pays for. Under bottleneck migration the cheap part — generation — is exactly what those metrics count.

Why it works

Engineering value is already denominated in human time — salaries, contractor rates, and estimates all use hours. Converting agent output back into hours makes ROI directly comparable to the instruments finance and headcount planning already run (Cognition, 2026-06-04). The mechanism is denominator alignment, not ground-truth measurement: it speaks the language of the decisions it informs (renew the seat, raise the cap, hire instead).

The denominator is urgent now. Agentic workloads carry 58.9% of token volume on Vercel's AI Gateway, up from 31.6% six months earlier — tool-using requests are ~2.6× more token-heavy than the rest (Vercel AI Gateway production index, 2026-05-12). Uber capped employees at $1,500/month per agentic coding tool after burning the annual AI budget in four months (TechCrunch, 2026-06-02). Token spend has a denominator; agent output, until now, did not.

When this backfires

The metric estimates counterfactual human time. Every failure mode below traces back to that one property.

  • High-context maintenance on familiar codebases. A randomized controlled trial of experienced open-source developers measured a 19% slowdown with AI tools while developers still reported a 20% speedup — a 39-point perception gap (METR, 2025-07-10). Cognition's model is calibrated against user reports and its corrected estimates still sit 1.4× below those reports (Cognition, 2026-06-04) — consistent with self-report inflation, not independent of it. Pair with an observed downstream signal; the productivity-experience paradox is the warning that perception and reality diverge here.
  • Downstream cost can absorb the gain. AI-assisted teams complete 21% more tasks and merge 98% more PRs while PR review time rises 91% — the bottleneck migrates (Osmani, 2025). An hours-saved figure that ignores review time spent is half a ledger.
  • Task selection bias inflates apparent value. Agents get the tasks they are best at; the reference human is then estimated for tasks pre-selected to favor the agent. Compare baselines on stratified task mixes, not aggregate counts.
  • Small-team noise floor. At r_log = 0.74, ~50% of per-session estimates fall within a factor of 2. A 10-session month sits inside that band; reading a 30% month-over-month change as signal is reading noise.
  • Greenfield work has no stable reference. "How long would a human have taken?" assumes a stable counterfactual. For novel problems with no comparable human baseline, the denominator is fabricated and the hours figure is no more grounded than an opinion.

Example

A platform team runs Devin and Claude Code across two months, defending (or cancelling) a $1,500/seat agentic-coding budget.

Before — counting completions:

Month 1: 47 PRs merged, 12,400 lines changed, $9,200 spend
Month 2: 51 PRs merged, 11,800 lines changed, $11,400 spend

Lines changed and PR counts both rise; spend rises faster; the conversation stalls on whether 47 PRs are "worth" $9,200.

After — denominating in human-equivalent hours, with downstream signals:

Month 1: 47 PRs merged → 184 estimated human-hours
         PR review time: 38h spent; defect-rate flat
         Implied rate: $9,200 / 184h = $50/h
Month 2: 51 PRs merged → 201 estimated human-hours
         PR review time: 61h spent; defect-rate flat
         Implied rate: $11,400 / 201h = $57/h

The estimate is calibrated on PR-merged sessions only (Cognition's gate). The implied $/h is now directly comparable to the team's loaded hourly rate. The 61h of review time is the observed signal that anchors the estimate — if review time were rising faster than agent hours, the gain is being absorbed downstream and the metric must say so.

Key Takeaways

  • The metric: how long would a human engineer have taken to produce the same output — chosen because engineering value is already denominated in hours (Cognition, 2026-06-04).
  • Cognition's calibration: h = 2.28 × m^0.923 (or a simplified 2.08× constant) fit on 258 sessions across 126 users; r_log = 0.74 on held-out data; ~50% within a factor of 2.
  • Code volume is empirically rejected as the metric — lines-changed regression returns R²_log = 0.27 against human estimates.
  • Anti-gaming: include sessions only if a PR merges; for non-PR sessions, run an unproductive-session classifier (drops 1–20%).
  • The signal inherits self-report inflation. Pair with an observed downstream signal (PR review time, defect rate) — METR's RCT measured a 19% slowdown while developers perceived a 20% speedup (METR, 2025-07-10).
  • Apply only on PR-gated, multi-month aggregates with a paired downstream signal; small samples and greenfield work sit inside the noise floor.
Feedback