Attention Mechanism

Definition

A mechanism that computes a context vector as a learned, input-dependent weighted combination of encoder states, allowing the decoder to “attend” to different parts of the input at each output step.

Intuition

The fixed-size context vector in vanilla seq2seq is an information bottleneck for long sequences; attention lets the decoder look back at all encoder states and focus on the relevant parts at each output step — analogous to how a human translator refers back to the source sentence while translating.

Formal Description

At decoder step , compute alignment score with each encoder state : ; common score functions: dot product, additive/Bahdanau: , multiplicative: .

Attention weights (softmax): ; .

Context vector: ; fed to decoder along with previous output .

Complexity: alignment computations per sequence pair.

Relation to self-attention: in attention the query comes from the decoder, keys/values from the encoder; in self-attention all three come from the same sequence (see transformer).

Applications

Machine translation (Bahdanau 2015), image captioning (attend to spatial CNN features), speech recognition; the attention concept generalizes to Transformers (self-attention).

Trade-offs

alignment computation can be expensive; attention is not a silver bullet for very long sequences; the attention weights provide human-interpretable alignment but can be noisy.