Skip to content

Verification

How to measure agent output quality, design evaluation suites, and use evals to drive development.

Measuring Quality

Behavioral Testing

Regression Testing

Eval-Driven Development

Review Techniques

Rubric Design

Guardrails

  • Deterministic Guardrails Around Probabilistic Agents — Wrap agent output in hard, deterministic checks — linting, schema validation, CI gates — that enforce correctness regardless of what the agent produces
  • Staged Evidence Gates for Agentic Program Repair — Order cheap evidence gates ahead of expensive ones in agentic repair loops — retrieval-grounded context, compile gate, target-test gate, then full regression — to filter invalid candidates before paying full-suite cost
  • Execution Budgeting in Agentic Program Repair — Cap test executions in generate-run-revise loops on frontier commercial agents; prohibiting execution drops resolve rate ~1.25 pp on SWE-bench Verified — applies only under specific preconditions on model strength, codebase familiarity, and execution cost
  • Dependency Gap Validation for AI-Generated Code — AI coding agents declare a fraction of the dependencies their code actually needs at runtime — validate in clean environments before trusting the manifest
  • Phantom Symbol Detection for LLM API Migration — Verify symbols in LLM-generated migration code against a documentation-derived knowledge base — a deterministic check that catches fabricated imports, constructors, and methods that probabilistic judges miss
  • Generative Provenance Records for Tool-Using Agents — Emit a structured record (tool turn, evidence span, relation) alongside each output sentence so a mechanical verifier can check claim-level grounding before the answer leaves the loop
  • Defense-in-Depth Against Coding Agent Fabrication (Honesty Harness) — Four uncorrelated layers — instruction-level honesty rules, verify-before-write, real-time hooks that feed output back, and an external-tool fact-checker subagent — that reduce fabrication survival without claiming elimination
  • Layered Oracle Stack for Agent IaC Security Repair (TerraProbe) — Stack scanner-pass, full-scanner, validate, plan, and plan-diff oracles so LLM-generated infrastructure-as-code security fixes have to clear behavioral checks — first-pass agent repairs cleared the targeted Checkov finding 83.3 percent of the time but 71.4 percent of plan-compared repairs were deceptive fixes

Tooling

Feedback