Attention Mechanism
Definition
A mechanism that computes a context vector as a learned, input-dependent weighted combination of encoder states, allowing the decoder to “attend” to different parts of the input at each output step.
Intuition
The fixed-size context vector in vanilla seq2seq is an information bottleneck for long sequences; attention lets the decoder look back at all encoder states and focus on the relevant parts at each output step — analogous to how a human translator refers back to the source sentence while translating.
Formal Description
At decoder step , compute alignment score with each encoder state : ; common score functions: dot product, additive/Bahdanau: , multiplicative: .
Attention weights (softmax): ; .
Context vector: ; fed to decoder along with previous output .
Complexity: alignment computations per sequence pair.
Relation to self-attention: in attention the query comes from the decoder, keys/values from the encoder; in self-attention all three come from the same sequence (see transformer).
Applications
Machine translation (Bahdanau 2015), image captioning (attend to spatial CNN features), speech recognition; the attention concept generalizes to Transformers (self-attention).
Trade-offs
alignment computation can be expensive; attention is not a silver bullet for very long sequences; the attention weights provide human-interpretable alignment but can be noisy.