Attention Mechanism
The transformer operation that lets each token focus on the other tokens most relevant to it.
Self-attention is what makes transformers work. For each token, the model computes a weighted average of all other tokens in the sequence, where the weights are learned. The token learns to pay attention to whatever else in the input is most useful.
Multi-head attention runs many such operations in parallel, each with its own learned weights, then combines them. This lets a layer attend to multiple kinds of relationships at once syntactic, semantic, positional.
Attention is also the bottleneck that makes long contexts expensive naive attention is O(n²) in sequence length. Techniques like FlashAttention, sliding-window attention, and KV-cache reuse have made today's massive context windows practical.