Edit Format Selection: Diff vs. Search-Replace vs. Full Rewrite¶

Edit format is how an LLM expresses code changes — full file, search-replace, or structure-aware diff — and the choice swings accuracy and cost 2-3x.

Why format choice matters¶

Switching a strict unified diff to a search-replace block raised GPT-4's score on Aider's editing benchmark from 26% to 59%, with no model or prompt change (Aider, "Unified diffs make GPT-4 Turbo 3X less lazy"). The "To Diff or Not to Diff?" paper explains why: "fragile offsets and fragmented hunks make generation highly unnatural for LLMs" (Cao et al., 2026, arxiv:2604.27296). The Diff-XYZ benchmark finds search-replace beats unified diffs for larger models, while smaller open models gain almost nothing from any format change (Glukhov et al., 2025, arxiv:2510.12487). The effect is real, but not universal.

The format spectrum¶

Format	Anchor	Best for
Full rewrite	Whole file	Short files; small edits where anchor overhead exceeds the file
Search-replace block	Exact `old_str` / `new_str`	Frontier-model API users; mixed-language repos
Structure-aware diff (BlockDiff / FuncDiff)	AST block (control structure or function)	Long files; fine-tuned models; AST-supported languages
Line-numbered unified diff	`@@ -old,+new @@` headers	Patch tools and humans, not LLMs

Full rewrite¶

The model emits the entire updated file. It always applies, but cost scales linearly with file length. On function-level benchmarks, full rewrite is the cheapest option for ~50–60% of samples — short bodies make anchor overhead larger than the new content (arxiv:2604.27296 §5).

Search-replace block¶

The model emits exact old_str and new_str; the harness substitutes one for the other. Aider and Anthropic's str_replace_based_edit_tool both use this shape. Anthropic requires the match to be unique — multiple matches return an error and force the model to expand context (Claude text editor docs). Aider treats each unified-diff hunk as a search-replace operation, ignoring line-number headers (Aider unified-diffs).

Structure-aware diff¶

BlockDiff and FuncDiff use tree-sitter to align hunks to AST nodes — control structures for BlockDiff, functions and classes for FuncDiff. Anchors expand outward until contextually unique. On long-code edits (>300 tokens), AdaEdit cuts latency and cost by over 30% versus full rewrite while matching its accuracy (arxiv:2604.27296 §5).

Line-numbered unified diff¶

LLMs drop blank lines, forget the leading +, and miscount offsets in @@ headers. Aider's documentation captures the practitioner consensus: "GPT is terrible at working with source code line numbers" (Aider unified-diffs).

The mechanism: distribution alignment¶

LLMs are trained on coherent code spans, not patch fragments. A line-numbered diff forces a non-local positional commitment (the @@ header) while emitting code; any drift produces an unappliable patch. Aligning the diff unit to a syntactic block moves generation back inside the training distribution — the model emits a complete unit as it would during code completion (arxiv:2604.27296 §3).

Search-replace captures most of this gain without AST awareness: it swaps positional commitments for content anchors, and any unique span of code is an in-distribution target.

Adaptive selection (AdaEdit)¶

For each source–target pair, AdaEdit picks whichever is shorter — diff or full rewrite — as the training label. The fine-tuned model learns to choose the cheaper format per sample, hitting >90% selection accuracy (>95% within a 20% token-deviation tolerance) (arxiv:2604.27296 §4).

Without fine-tuning, a harness can approximate this rule: full rewrite below ~300 tokens of file content, search-replace above.

Selection heuristic¶

graph TD
    A[Edit request] --> B{File length}
    B -->|< 300 tokens| C[Full rewrite]
    B -->|>= 300 tokens| D{Fine-tuning available?}
    D -->|Yes| E[Structure-aware diff + AdaEdit]
    D -->|No| F{Match anchor unique?}
    F -->|Yes| G[Search-replace block]
    F -->|No| C

For frontier API models, the decision reduces to full rewrite versus search-replace. Structure-aware diffs pay off when you control fine-tuning data and the language has reliable AST tooling.

When this backfires¶

Small or open-weights models: Diff-XYZ shows smaller models gain little from format engineering — they fail at all formats roughly equally (arxiv:2510.12487).
Short edits to short files: anchor plus surrounding context exceeds the file, so full rewrite is strictly cheaper.
Languages without solid AST tooling: BlockDiff and FuncDiff rely on tree-sitter grammars, and templated or DSL-mixed source loses the structural guarantee.
Non-unique repeated code: search-replace fails when the anchor matches multiple sites; Anthropic's tool errors and forces a retry with expanded context (Claude text editor docs).
No fine-tuning access: AdaEdit is a training-time strategy, so API users can apply the formats but cannot replicate adaptive selection without harness-side heuristics.

Example¶

A 1,200-token Python file needs a five-line change inside one function. Three formats give three cost profiles:

Full rewrite: regenerate the entire file, ~1,200 output tokens. It always applies and scales linearly with file size.
Search-replace block: unique surrounding context as old_str, modified function as new_str, ~150–250 output tokens. It applies if the anchor is unique; otherwise the harness errors and the model retries with more context.
FuncDiff: modified function with an AST-derived anchor, ~120–200 output tokens. It requires tree-sitter at apply time and a model trained on the format to perform reliably (arxiv:2604.27296 §3).

Search-replace is the practical choice for a frontier API model on this file. Below ~300 tokens of file content, full rewrite would have been cheaper.

Key Takeaways¶

Edit format is a real lever on accuracy and cost — measured in 30%+ token reductions and 2–3× accuracy gains in published benchmarks.
The mechanism is distribution alignment: replace positional anchors (line numbers) with content anchors (unique strings or AST blocks) so the model emits coherent code instead of fragmented patches.
Format engineering helps frontier models more than smaller ones; verify the lever exists for your model class before investing.
For API consumers, the practical decision is full rewrite for short files, search-replace above ~300 tokens; structure-aware diffs add value mainly when you control fine-tuning.