Skip to content

The Consistent Capability Fallacy

Capability on one task does not predict capability on a similar-seeming task — LLM performance is jagged, not consistent.

The belief

When a model handles one complex task well, people generalize: "this model is good at this kind of problem." They then raise the autonomy level for later tasks that look related and skip per-task verification. They are surprised when the model fails a task that seems simpler than one it already passed.

Why it fails

LLM capability is jagged, not smooth. A model may pass an international Math Olympiad problem yet fail at multi-digit long division. The Olympiad problem is well represented in training data as a recognizable pattern. Long division needs an algorithmic process the model approximates poorly.

Training data distribution sets the model's performance profile, not any general "skill level." Perceived difficulty and model difficulty are unrelated. A task that looks harder to a human is sometimes easier for the model, and a task that looks easy is sometimes harder.

Evidence of the failure mode:

Small changes to prompt wording cause around 15% accuracy swings. The same input does not yield stable output.

The compounding risk

The natural language interface masks failures. Models produce confident, plausible-looking output even when the reasoning behind it is wrong. This creates false confidence in both the current output and the model's general reliability. The main danger is not a model that fails obviously. It is people who overestimate capability because they saw it succeed.

Example

A team delegates a complex architectural refactor to Claude Code. The model handles it well and restructures several services with correct dependency handling. Encouraged, the team delegates a "simpler" task the next sprint: updating multi-step data validation logic across a module. This task fails silently. The model carries an incorrect assumption through every updated path, and the output looks plausible. No one checks because the model "already proved itself" on a harder task.

The architectural task was well represented in training patterns. The validation logic needed algorithmic precision the model approximated badly. From the model's point of view, these were not similar tasks.

When this backfires

Treating every task as an independent capability question adds overhead. That overhead is not worth it when:

  • The task class is narrow and well understood. For highly repetitive, formulaic work, such as generating boilerplate CRUD endpoints against a fixed schema, repeated success shows the training distribution covers the pattern well. Per-task verification still helps, but calibrating autonomy from prior runs is reasonable.
  • The domain has high benchmark saturation. Tasks that appear word for word or structurally in widely used public benchmarks, such as standard algorithm implementations and common regex patterns, perform more stably than tasks in unseen problem spaces. The jaggedness is real, but not uniform across all task types.
  • Verification cost exceeds failure cost. For low-stakes work that is easy to revert, scaling re-verification to risk keeps you from slowing delivery more than the occasional failure costs. Weigh this pattern's guidance against your verification budget.

The fallacy is most dangerous for tasks that look familiar but need compositional reasoning the model has not practiced in exactly that combination.

Key Takeaways

  • Capability on task A does not predict capability on task B, even when A and B appear related to a human observer.
  • Calibrate autonomy level per task — not per session or per model version.
  • Treat each new task type as an independent capability question: verify before raising autonomy.
Feedback