Transformer

Definition

A sequence model built entirely from self-attention and feed-forward blocks, with no recurrence; processes all tokens in parallel and connects any two positions in O(1) operations.

Intuition

RNNs process tokens sequentially — information from early tokens must “pass through” all subsequent hidden states, degrading over long distances; self-attention connects every token to every other token directly, making long-range dependencies equally accessible; multi-head attention learns multiple attention patterns simultaneously (syntactic, semantic, positional).

Formal Description

Scaled dot-product attention: $Attention (Q, K, V) = softmax (Q K^{⊤} / d_{k}) V$ where $Q \in R^{T \times d_{k}}$ , $K \in R^{T \times d_{k}}$ , $V \in R^{T \times d_{v}}$ ; scaling by $d_{k}$ prevents softmax saturation for large $d_{k}$ .

Self-attention: queries, keys, and values are all linear projections of the same input sequence; $Q = X W^{Q}$ , $K = X W^{K}$ , $V = X W^{V}$ ; captures dependencies within the sequence.

Multi-head attention: $h$ parallel attention heads with separate projection matrices $W_{i}^{Q}, W_{i}^{K}, W_{i}^{V} \in R^{d_{model} \times d_{k}}$ ; $MultiHead (Q, K, V) = Concat (head_{1}, \dots, head_{h}) W^{O}$ ; different heads can specialize in different relation types.

Transformer block: LayerNorm → Multi-Head Self-Attention → residual add → LayerNorm → Feed-Forward (two linear + ReLU) → residual add; encoder: stack of $N$ such blocks; decoder: same but adds cross-attention over encoder output.

Positional encoding: since self-attention is permutation-equivariant, inject position information; sinusoidal: $PE (p os, 2 i) = sin (p os /1000 0^{2 i / d})$ , $PE (p os, 2 i + 1) = cos (\dots)$ ; or learned positional embeddings (BERT/GPT style).

Complexity: $O (T^{2} d)$ per layer for self-attention (vs. $O (T d^{2})$ for RNN); for $T < d$ , Transformers are faster than RNNs.

Applications

All modern NLP (BERT, GPT, T5); vision (ViT, DINO); multimodal models; audio processing; protein structure (AlphaFold2).

Trade-offs

$O (T^{2})$ attention scaling limits context length (addressed by sparse attention, linear attention, state-space models); no inductive bias for locality (CNNs have this for free); requires large datasets to train from scratch; fine-tuning pretrained Transformers is the standard approach.

Notes

Explorer

transformer

Transformer

Definition

Intuition

Formal Description

Applications

Trade-offs

Links

Graph View

Table of Contents

Backlinks