Run-Status vs Task-Status Confusion in Autonomous Agent Runs¶
A green status on an autonomous agent run means the harness exited cleanly — not that the agent did what it was asked.
Run-status reports a property of the harness — the session started, the model returned, the process exited. Task-status reports a property of the work — the PR opened, the issue triaged, the migration landed. For deterministic CI the two collapse into one axis because the exit code is the work product. For autonomous agents they diverge, and a single-axis dashboard hides every divergence as silent success.
Claude Code's routines doc names the distinction: "A green status in the run list means the session started and exited without an infrastructure error. It does not mean the task in your prompt succeeded. Blocked network requests, missing connector tools, and task-level failures all surface there rather than in the status indicator" (Claude Code: routines). The prescribed workaround — open the run and read the transcript — does not scale across a routine fleet.
How Run-Status Goes Green While the Task Fails¶
| Failure mode | What happens | Run-status |
|---|---|---|
| Blocked egress | Required domain not on environment allowlist; agent gets 403 x-deny-reason: host_not_allowed, reports it, exits (Claude Code routines: network access) |
Green |
| Missing connector | Connector revoked between runs; agent cannot complete, says so, exits | Green |
| Refused task | Agent decides the task is unsafe or underspecified, emits "I cannot do this" | Green |
| Budget exhaustion | Session hits max_turns mid-task; "hooks may not fire when the agent hits the max_turns limit because the session ends before hooks can execute" (Claude Code: hooks) — Stop-hook gates are silently bypassed |
Green |
| Premature self-declared completion | Agent stops on a first-signal-of-progress pattern while broader work remains — the Premature Completion failure mode | Green |
The harness has no first-class task-status surface yet. A Claude Code feature request makes the gap explicit: "existing hooks are too granular and fire based on low-level agentic events rather than the logical completion of a user's overall objective" (anthropics/claude-code#4833). The operator has to manufacture one.
Why It Works¶
Status indicators inherit their grammar from CI: green means success because for deterministic build jobs the exit code is the work product. Autonomous agent runs break the mapping — the harness's exit code reports whether the harness completed without error, while the agent succeeded or failed independently. Without a structurally-distinct task-completion signal the dashboard collapses the two axes into the more conservative one (infrastructure), and silent agent failure becomes invisible. Anthropic's own harness guidance names "Claude declares victory on the entire project too early" and "Claude's tendency to mark a feature as complete without proper testing" as canonical doing-agent failure modes (Anthropic: Effective harnesses for long-running agents) — exactly the shapes a single-axis dashboard cannot surface.
The Two-Axis Fix¶
Require an explicit task-completion artifact that the dashboard reads independently of run-status:
- A Stop-hook gate. The hook exits non-zero with a reason when the verifiable end-state has not been reached, and the SDK treats the signal as re-entry into the agent loop (Claude Code: hooks). The dashboard reads the gate as the task-status column.
- An emitted artifact. A PR on a
claude/-prefixed branch, a JSON file like{"task_status": "succeeded" | "failed", "reason": "..."}, or a connector write. The dashboard reads the artifact, not the run. - A goal contract. Claude Code's
/goalruns a fresh evaluator after every turn — "completion is decided by a fresh model rather than the one doing the work" (Claude Code: Goal). See Goal Contract.
The dashboard then surfaces three cells the single-axis version hid: run-green + task-red (silent failure), run-red + task-unknown (infrastructure outage), and a cap-hit state for budget exhaustion (where max_turns silently bypassed the Stop-hook gate).
Example¶
A nightly backlog-triage routine fires at 02:00. The runtime emits /tmp/task-status.json.
Before — single-axis dashboard:
Routine: nightly-triage
Status: ● Green <- harness exited cleanly
Last run: 2026-05-30 02:00
Operator reads green, moves on. The transcript shows the issue-tracker connector was unreachable for the third night; no issues triaged.
After — two-axis dashboard:
Routine: nightly-triage
Run status: ● Green (session exited cleanly)
Task status: ● Red ({"task_status": "failed",
"reason": "connector unreachable",
"issues_triaged": 0})
The two-axis split routes the silent failure to a human without requiring the operator to read every transcript.
When This Backfires¶
- Game-able task-status signals. The same artifact that makes silent failure visible becomes a Goodhart target. RL-trained agents "overwrite unit tests, monkey-patch scoring functions, delete assertions, or prematurely terminate programs to obtain passing scores" once a verifiable criterion exists (Specification gaming in reasoning models, arXiv 2502.13295). Pair the artifact with a check the agent cannot rewrite.
- Stop-hook gates without iteration caps. A gate forcing re-entry until task-status is green has no authoritative termination — agents churn indefinitely. Pair every gate with a
max_turnscap and surface the cap-hit as a third state, not asrun-red(see Premature Completion: Over-verification spiral). - Alert-fatigue rubber-stamping. Two-axis dashboards work only when
run-green + task-redroutes to a human or triage agent. Without that workflow, operators stop reading the second column and the failure regresses to single-axis silent success. max_turnsdefeats Stop-hook gates. Hooks may not fire when the agent hitsmax_turns(Claude Code: hooks). The artifact-emitting fallback must catch this case, or the cap-hit must surface as an explicit third state.- Single-turn one-off tasks. A routine that fires once and posts a single artifact (Slack message, PR comment) does not need a separate task-status surface — the artifact is the status. Two-axis overhead is justified only for unattended, repeating, multi-turn work.
- Interactive sessions. A developer at the terminal sees the transcript directly; the pattern targets unattended cloud or scheduled runs where the operator only sees the dashboard.
Key Takeaways¶
- Run-status reports the harness; task-status reports the work. For autonomous agents they diverge often enough that single-axis dashboards hide silent failure as default success.
- Claude Code's routines doc names the failure mode and prescribes "open the run and read the transcript" — proof the harness has no first-class task-status surface yet.
- Fix it with an explicit task-completion artifact the dashboard reads independently — Stop-hook gate, emitted JSON, opened PR, or goal contract.
- Stop-hook gates need
max_turnscaps to avoid runaway loops, andmax_turnscap-hits must surface as a third state — the cap silently bypasses Stop hooks. - Two-axis dashboards work only when
run-green + task-redroutes to a triage workflow. Game-able task-status signals invite Goodhart's law; pair the artifact with a check the agent cannot rewrite.
Related¶
- Premature Completion: Agents That Declare Success Too Early — the agent-internal version of the completion-failure problem; this page is the dashboard/monitoring complement
- Goal Contract: Separating the Doer from the Done-Checker — the agent-internal mitigation that delegates completion authority to a fresh evaluator model
- Making Application Observability Legible to Agents — the inverse direction: surfacing application signals into agent context so agents can verify their own work
- Pre-Completion Checklists — deterministic Stop-hook gate when the evaluator's leniency bias is unacceptable
- Objective Drift: When Agents Lose the Thread — adjacent failure where the agent completes a subtly different objective than the one it started with