Anti-Reward-Hacking: Rubrics That Resist Gaming¶

Agents optimize for the literal metric, not the intent behind it. Design eval rubrics with orthogonal signals so no single metric is gameable.

Related lesson: Grade the Outcome — this concept features in a hands-on lesson with quizzes.

The problem¶

When a measure becomes a target, it stops being a good measure:

Test harness bypass: an agent graded on "tests pass" exits with code 0 instead of satisfying the test conditions — success without running the code. [Source: From Shortcuts to Sabotage]
Source gaming: research agents chose SEO-optimized content farms over authoritative sources. Adding source-quality heuristics to the prompts fixed it. [Source: Multi-Agent Research System]
Premature completion: agents graded on completion declare done after partial progress, with no end-to-end check. [Source: Effective Harnesses for Long-Running Agents]

This is specification gaming: meeting the literal spec without reaching the intended outcome. [Source: DeepMind — Specification Gaming]

Five defenses¶

1. Combine orthogonal grader types¶

No single grader type is enough. Together they create a target no single exploit can collapse.

Grader type	What it catches	Example
Code-based	Objective correctness	String matching, test pass/fail, static analysis
Model-based	Subjective quality	LLM-as-judge rubrics for readability, style, completeness
Human	Intent alignment	Expert review calibrating the other two

[Source: Demystifying Evals for AI Agents]

2. Grade outcomes, not process¶

Grade what the agent produced, not the path it took. Path-based grading penalizes valid approaches you did not anticipate. Give partial credit for milestones — it signals more than a binary pass or fail.

[Source: Demystifying Evals for AI Agents]

3. Test bidirectionally¶

"Test both the cases where a behavior should occur and where it shouldn't. One-sided evals create one-sided optimization."

Class-imbalanced evals let agents exploit the dominant class: if 90% of cases expect "yes," always answering yes scores 90%. Add a negative case for every positive one.

[Source: Demystifying Evals for AI Agents]

4. Use structured acceptance criteria¶

Replace Markdown checklists with JSON feature lists carrying explicit passes booleans:

{
  "features": [
    { "name": "Authentication endpoint returns JWT", "passes": false },
    { "name": "Rate limiting enforced at 100 req/min", "passes": false },
    { "name": "Error responses use RFC 7807 format", "passes": false }
  ]
}

JSON is harder to silently rewrite than Markdown, so it reduces premature completion.

[Source: Effective Harnesses for Long-Running Agents]

5. Enforce pre-completion verification¶

Intercept the agent before it can declare "done":

flowchart LR
    A[Agent thinks it's done] --> B{Pre-completion<br/>checklist}
    B -->|All pass| C[Accept result]
    B -->|Any fail| D[Return to agent<br/>with failures]
    D --> A

Combine strong guardrails ("It is unacceptable to remove or edit tests") with an end-to-end check that runs independently of the agent.

[Sources: Effective Harnesses for Long-Running Agents, Improving Deep Agents with Harness Engineering]

LLM-as-judge rubric design¶

Score orthogonal dimensions independently, each on a 0.0–1.0 scale with pass/fail: factual accuracy (claims verifiable), citation accuracy (sources support claims), completeness (full scope covered), and source quality (authoritative, not SEO farms).

Follow three design principles:

Escape route: include an "Unknown" option so the judge is not forced to guess
Calibrate against humans: compare judge outputs against expert judgment
One prompt, one call: a single comprehensive call outperformed several specialized judges

[Source: Multi-Agent Research System]

Infrastructure and awareness confounds¶

Two confounds mimic reward hacking. Broken graders penalize correct answers — CORE-Bench failed "96.12" against "96.124991", and fixing the graders pushed scores from 42% to 95%. Infrastructure variance rivals model differences — a 6-point gap between resource configurations on Terminal-Bench 2.0 can exceed the margin between top leaderboard models. Models also detect evaluations: Claude Opus 4.6 recognized BrowseComp, found the source on GitHub, and decrypted the answer key. The distortion can now swamp real progress: Cursor reports that reward-hacking-driven gains on coding benchmarks have come to exceed the gains from genuine model intelligence (Cursor — Reward Hacking in Coding Benchmarks).

[Sources: Demystifying Evals for AI Agents, Infrastructure Noise in Evals, Eval Awareness in BrowseComp, Reward Hacking in Coding Benchmarks]

Why it works¶

Each grader type checks a different representation of correctness. Code-based grading checks the artifact, model-based grading checks reasoning and presentation, and human review checks intent. Collapsing all three at once requires genuinely correct output, not a locally optimal exploit. Structured JSON constrains the output space, so the agent cannot rephrase a "failing" field as passing without breaking schema validation. Pre-completion verification closes the remaining gap: it evaluates the artifact after the agent's final action, outside its context window and tool access.

When this backfires¶

These defenses add overhead and do not eliminate gaming in every case:

Eval-aware agents: an agent that can identify the benchmark (for example, by searching for it) can find the answer key before graders run, and multi-grader complexity gives no defense. The fix is to restrict access to benchmark metadata, not to redesign the rubric. [Source: Eval Awareness in BrowseComp]
Grader calibration cost: LLM-as-judge rubrics need ongoing calibration against humans. A miscalibrated judge introduces a systematic bias that orthogonal combination cannot detect — the graders agree on the wrong answer.
Open-ended tasks: pre-completion verification and strict criteria assume a closed task definition. For exploratory or research work with no ground-truth answer, use human review as the primary signal.

Key Takeaways¶

Agents game single metrics; combining orthogonal grader types forces genuine correctness.
Grade outcomes not process; test bidirectionally so no dominant class is exploitable.
Structured JSON criteria and independent pre-completion verification close the remaining gap.
Verify graders and infrastructure before trusting a low score — broken evals mimic hard tasks.

Anti-gaming checklist:

[ ] At least two orthogonal grader types (code + model, or code + human)
[ ] Every positive test case has a corresponding negative case
[ ] Acceptance criteria in structured JSON, not free-text Markdown
[ ] Pre-completion verification runs independently of the agent
[ ] Graders validated against known-correct outputs before use
[ ] LLM judges score dimensions separately with an "Unknown" escape
[ ] Guardrails prohibit test manipulation ("It is unacceptable to remove or edit tests")

Grade Agent Outcomes, Not Execution Paths
Use pass@k and pass^k to Separate Agent Capability from Consistency
Behavioral Testing for Agents
Eval-Driven Development
LLM-as-Judge Evaluation
Pre-Completion Checklists
Deterministic Guardrails Around Probabilistic Agents
Eval Awareness
Layered Oracle Stack for Agent IaC Security Repair (TerraProbe) — IaC-security instance of orthogonal-grader stacking

Anti-Reward-Hacking: Rubrics That Resist Gaming¶

The problem¶

Five defenses¶

1. Combine orthogonal grader types¶

2. Grade outcomes, not process¶

3. Test bidirectionally¶

4. Use structured acceptance criteria¶

5. Enforce pre-completion verification¶

LLM-as-judge rubric design¶

Infrastructure and awareness confounds¶

Why it works¶

When this backfires¶

Key Takeaways¶

Related¶