Skip to content

Edit Format Selection: Diff vs. Search-Replace vs. Full Rewrite

Edit format is how an LLM expresses code changes — full file, search-replace, or structure-aware diff — and the choice swings accuracy and cost 2-3x.

Why format choice matters

Switching a strict unified diff to a search-replace block raised GPT-4's score on Aider's editing benchmark from 26% to 59%, with no model or prompt change (Aider, "Unified diffs make GPT-4 Turbo 3X less lazy"). The "To Diff or Not to Diff?" paper explains why: "fragile offsets and fragmented hunks make generation highly unnatural for LLMs" (Cao et al., 2026, arxiv:2604.27296). The Diff-XYZ benchmark finds search-replace beats unified diffs for larger models, while smaller open models gain almost nothing from any format change (Glukhov et al., 2025, arxiv:2510.12487). The effect is real, but not universal.

The format spectrum

Format Anchor Best for
Full rewrite Whole file Short files; small edits where anchor overhead exceeds the file
Search-replace block Exact old_str / new_str Frontier-model API users; mixed-language repos
Structure-aware diff (BlockDiff / FuncDiff) AST block (control structure or function) Long files; fine-tuned models; AST-supported languages
Line-numbered unified diff @@ -old,+new @@ headers Patch tools and humans, not LLMs

Full rewrite

The model emits the entire updated file. It always applies, but cost scales linearly with file length. On function-level benchmarks, full rewrite is the cheapest option for ~50–60% of samples — short bodies make anchor overhead larger than the new content (arxiv:2604.27296 §5).

Search-replace block

The model emits exact old_str and new_str; the harness substitutes one for the other. Aider and Anthropic's str_replace_based_edit_tool both use this shape. Anthropic requires the match to be unique — multiple matches return an error and force the model to expand context (Claude text editor docs). Aider treats each unified-diff hunk as a search-replace operation, ignoring line-number headers (Aider unified-diffs).

Structure-aware diff

BlockDiff and FuncDiff use tree-sitter to align hunks to AST nodes — control structures for BlockDiff, functions and classes for FuncDiff. Anchors expand outward until contextually unique. On long-code edits (>300 tokens), AdaEdit cuts latency and cost by over 30% versus full rewrite while matching its accuracy (arxiv:2604.27296 §5).

Line-numbered unified diff

LLMs drop blank lines, forget the leading +, and miscount offsets in @@ headers. Aider's documentation captures the practitioner consensus: "GPT is terrible at working with source code line numbers" (Aider unified-diffs).

The mechanism: distribution alignment

LLMs are trained on coherent code spans, not patch fragments. A line-numbered diff forces a non-local positional commitment (the @@ header) while emitting code; any drift produces an unappliable patch. Aligning the diff unit to a syntactic block moves generation back inside the training distribution — the model emits a complete unit as it would during code completion (arxiv:2604.27296 §3).

Search-replace captures most of this gain without AST awareness: it swaps positional commitments for content anchors, and any unique span of code is an in-distribution target.

Adaptive selection (AdaEdit)

For each source–target pair, AdaEdit picks whichever is shorter — diff or full rewrite — as the training label. The fine-tuned model learns to choose the cheaper format per sample, hitting >90% selection accuracy (>95% within a 20% token-deviation tolerance) (arxiv:2604.27296 §4).

Without fine-tuning, a harness can approximate this rule: full rewrite below ~300 tokens of file content, search-replace above.

Selection heuristic

graph TD
    A[Edit request] --> B{File length}
    B -->|< 300 tokens| C[Full rewrite]
    B -->|>= 300 tokens| D{Fine-tuning available?}
    D -->|Yes| E[Structure-aware diff + AdaEdit]
    D -->|No| F{Match anchor unique?}
    F -->|Yes| G[Search-replace block]
    F -->|No| C

For frontier API models, the decision reduces to full rewrite versus search-replace. Structure-aware diffs pay off when you control fine-tuning data and the language has reliable AST tooling.

When this backfires

  • Small or open-weights models: Diff-XYZ shows smaller models gain little from format engineering — they fail at all formats roughly equally (arxiv:2510.12487).
  • Short edits to short files: anchor plus surrounding context exceeds the file, so full rewrite is strictly cheaper.
  • Languages without solid AST tooling: BlockDiff and FuncDiff rely on tree-sitter grammars, and templated or DSL-mixed source loses the structural guarantee.
  • Non-unique repeated code: search-replace fails when the anchor matches multiple sites; Anthropic's tool errors and forces a retry with expanded context (Claude text editor docs).
  • No fine-tuning access: AdaEdit is a training-time strategy, so API users can apply the formats but cannot replicate adaptive selection without harness-side heuristics.

Example

A 1,200-token Python file needs a five-line change inside one function. Three formats give three cost profiles:

  • Full rewrite: regenerate the entire file, ~1,200 output tokens. It always applies and scales linearly with file size.
  • Search-replace block: unique surrounding context as old_str, modified function as new_str, ~150–250 output tokens. It applies if the anchor is unique; otherwise the harness errors and the model retries with more context.
  • FuncDiff: modified function with an AST-derived anchor, ~120–200 output tokens. It requires tree-sitter at apply time and a model trained on the format to perform reliably (arxiv:2604.27296 §3).

Search-replace is the practical choice for a frontier API model on this file. Below ~300 tokens of file content, full rewrite would have been cheaper.

Key Takeaways

  • Edit format is a real lever on accuracy and cost — measured in 30%+ token reductions and 2–3× accuracy gains in published benchmarks.
  • The mechanism is distribution alignment: replace positional anchors (line numbers) with content anchors (unique strings or AST blocks) so the model emits coherent code instead of fragmented patches.
  • Format engineering helps frontier models more than smaller ones; verify the lever exists for your model class before investing.
  • For API consumers, the practical decision is full rewrite for short files, search-replace above ~300 tokens; structure-aware diffs add value mainly when you control fine-tuning.
Feedback