Attention Sinks: Why First Tokens Always Win¶

Transformer models disproportionately attend to initial tokens regardless of their semantic content — position determines attention weight, not importance.

Related lesson: Lost in the Middle — this concept features in a hands-on lesson with quizzes.

Also known as

Lost in the Middle, Critical Instruction Repetition, Attention Bias and Instruction Placement

What attention sinks are¶

Attention in autoregressive transformer models is structurally biased toward early tokens. The first tokens act as attention sinks. They absorb a large share of attention from every later token, whatever their meaning at the current generation step. Xiao et al. (2023) confirmed this: keeping just the KV cache of the early tokens largely recovers the performance of full-window attention (StreamingLLM).

This is not a quirk to fix. It is a structural property of how causal attention masking works. Every token the model generates is shaped more by early tokens than by equivalent tokens placed later in the context.

A more precise account narrows the mechanism. Gu et al. (2024) found that the sink concentrates on the first token rather than spreading smoothly across an early-position band. It is a learned behavior that emerges during pre-training under softmax normalization. Replace softmax with sigmoid attention and the sink does not appear in models up to 1B parameters, so it is not strictly inherent to causal masking (When Attention Sink Emerges in Language Models). The practical takeaway holds: the strongest-attention position is the very start of the prompt. But treat "earlier is stronger" as a first-token-anchored, softmax-driven effect, not a uniform positional gradient.

Practical implications¶

The role definition placed first shapes behavior most. Whatever role, persona, or constraint appears at the very start of a system prompt gets stronger attention than the same constraint placed later. An instruction like "you are a security reviewer who never produces code without first identifying potential vulnerabilities" carries more weight at position 1 than at position 500.

Boilerplate wastes the highest-attention positions. A system prompt that opens with:

# System Prompt v2.3 — Agent: Code Reviewer
# Created: 2025-01-01
# Last updated: 2025-03-08

You are an AI assistant designed to help developers...

has spent the strongest-attention positions on metadata and generic preamble. The actual rules, constraints, and role definition follow in weaker-attention territory.

What you ask first, the agent recalls best. In a long conversation, restate a critical constraint at the point where you need it rather than relying on an early-session statement. This uses the recency effect at the other end of the U-shaped attention curve.

Applying the pattern¶

Start instruction files with the constraint or role that must be most reliably followed:

Never output code that modifies authentication or session state without
first identifying the downstream security impact.

You are reviewing a pull request...

Not:

You are an AI code reviewer assistant. Your goal is to provide
helpful, accurate code review feedback. When reviewing code, consider...
[10 lines later]
Never output code that modifies authentication or session state...

The rule comes first; the context follows. The agent's strongest recall is on the rule.

Relationship to the U-shaped curve¶

Attention sinks explain the strong-start portion of the U-shaped attention curve. Recency effects in autoregressive generation explain the strong-end portion. Liu et al. (2023) showed that performance is highest when relevant information sits at the beginning or end of the context window, and drops sharply in the middle (Lost in the Middle). Together:

First tokens: attention sink bias (high recall)
Middle tokens: weakest attention (low recall)
Last tokens: recency effect (high recall)

Content that must be reliably followed belongs at either end. Content the agent refers to passively can occupy the middle.

Key Takeaways¶

Initial tokens receive disproportionate attention — open instruction files with your most critical constraint, not context-setting prose.
Boilerplate at the top of a system prompt wastes the highest-attention positions on low-value content.
The role and constraints placed first shape agent behaviour most strongly across the session.
Attention sinks and recency effects are the two mechanisms behind the U-shaped attention distribution.

When this backfires¶

Context compression discards early tokens. Techniques that compress or truncate context, including some KV-cache eviction strategies, may discard early tokens and neutralize the primacy advantage. Placing critical constraints first is only reliable when the full prompt prefix is kept (Context Compression Strategies).
Fine-tuned models with instruction-following training. RLHF and instruction-tuning can shift how models weigh positional bias against semantic relevance. A model fine-tuned to follow instructions placed anywhere in the prompt may not show the same sink strength as a base model.
RAG pipelines with late-injected context. In retrieval-augmented workflows, retrieved chunks are usually injected mid-prompt. If the critical constraint is buried in a static preamble before a lot of retrieved content, the semantic relevance of that material may outweigh its positional advantage.
Very short prompts. Attention sink effects are strongest in long sequences. In short prompts (under a few hundred tokens), positional placement has less observable effect on model behavior.