Attention Mechanism

Definition

A mechanism that computes a context vector as a learned, input-dependent weighted combination of encoder states, allowing the decoder to “attend” to different parts of the input at each output step.

Intuition

The fixed-size context vector in vanilla seq2seq is an information bottleneck for long sequences; attention lets the decoder look back at all encoder states and focus on the relevant parts at each output step — analogous to how a human translator refers back to the source sentence while translating.

Formal Description

At decoder step $t$ , compute alignment score with each encoder state $a^{< t^{'} >}$ : $e_{t, t^{'}} = score (s^{< t - 1 >}, a^{< t^{'} >})$ ; common score functions: dot product, additive/Bahdanau: $v^{⊤} tanh (W_{1} s + W_{2} a)$ , multiplicative: $s^{⊤} Wa$ .

Attention weights (softmax): $α_{t, t^{'}} = \frac{e x p ( e _{t, t^{'}} )}{\sum _{j} e x p ( e _{t, j} )}$ ; $\sum_{t^{'}} α_{t, t^{'}} = 1$ .

Context vector: $c^{< t >} = \sum_{t^{'}} α_{t, t^{'}} a^{< t^{'} >}$ ; fed to decoder along with previous output $y^{< t - 1 >}$ .

Complexity: $O (T_{x} \cdot T_{y})$ alignment computations per sequence pair.

Relation to self-attention: in attention the query comes from the decoder, keys/values from the encoder; in self-attention all three come from the same sequence (see transformer).

Applications

Machine translation (Bahdanau 2015), image captioning (attend to spatial CNN features), speech recognition; the attention concept generalizes to Transformers (self-attention).

Trade-offs

$O (T_{x} \cdot T_{y})$ alignment computation can be expensive; attention is not a silver bullet for very long sequences; the attention weights provide human-interpretable alignment but can be noisy.

Notes

Explorer

attention_mechanism

Attention Mechanism

Definition

Intuition

Formal Description

Applications

Trade-offs

Links

Graph View

Table of Contents

Backlinks