Transformer

Definition

A sequence model built entirely from self-attention and feed-forward blocks, with no recurrence; processes all tokens in parallel and connects any two positions in O(1) operations.

Intuition

RNNs process tokens sequentially — information from early tokens must “pass through” all subsequent hidden states, degrading over long distances; self-attention connects every token to every other token directly, making long-range dependencies equally accessible; multi-head attention learns multiple attention patterns simultaneously (syntactic, semantic, positional).

Formal Description

Scaled dot-product attention: where , , ; scaling by prevents softmax saturation for large .

Self-attention: queries, keys, and values are all linear projections of the same input sequence; , , ; captures dependencies within the sequence.

Multi-head attention: parallel attention heads with separate projection matrices ; ; different heads can specialize in different relation types.

Transformer block: LayerNorm → Multi-Head Self-Attention → residual add → LayerNorm → Feed-Forward (two linear + ReLU) → residual add; encoder: stack of such blocks; decoder: same but adds cross-attention over encoder output.

Positional encoding: since self-attention is permutation-equivariant, inject position information; sinusoidal: , ; or learned positional embeddings (BERT/GPT style).

Complexity: per layer for self-attention (vs. for RNN); for , Transformers are faster than RNNs.

Applications

All modern NLP (BERT, GPT, T5); vision (ViT, DINO); multimodal models; audio processing; protein structure (AlphaFold2).

Trade-offs

attention scaling limits context length (addressed by sparse attention, linear attention, state-space models); no inductive bias for locality (CNNs have this for free); requires large datasets to train from scratch; fine-tuning pretrained Transformers is the standard approach.