Sequence-to-Sequence Models

Definition

Encoder-decoder architectures that map a variable-length input sequence to a variable-length output sequence; combined with beam search for decoding and BLEU for evaluation.

Intuition

The encoder summarizes the input sequence into a context representation; the decoder generates the output sequence one token at a time, conditioned on the context; beam search avoids greedy decoding by maintaining multiple candidate sequences.

Formal Description

Seq2seq: encoder RNN reads $x^{< 1 >}, \dots, x^{< T_{x} >}$ producing final hidden state (context vector) $c$ ; decoder RNN generates $y^{< 1 >}, \dots, y^{< T_{y} >}$ , initialized with $c$ ; $P (y^{< t >} ∣ y^{< 1 >}, \dots, y^{< t - 1 >}, c)$ modeled by the decoder at each step; unlike language model, conditioned on input.

Machine translation as conditional LM: $P (y_{1}, \dots, y_{T_{y}} ∣ x_{1}, \dots, x_{T_{x}}) = \prod_{t = 1}^{T_{y}} P (y^{< t >} ∣ y^{< 1 : t - 1 >}, x)$ ; training: teacher forcing (feed ground-truth previous tokens); inference: autoregressive decoding.

Beam search: at each decoding step, keep top $B$ partial sequences (by cumulative log-prob); at step $t$ , expand each of $B$ beams by all $∣ V ∣$ vocab words, score, keep top $B$ ; length normalization: divide by $(T_{y})^{α}$ with $α \approx 0.7$ to prevent preference for short sequences.

BLEU score (Bilingual Evaluation Understudy): measures n-gram precision of machine translation against reference translations; $BLEU = BP \cdot exp (\sum_{n = 1}^{N} w_{n} lo g p_{n})$ where $p_{n}$ = modified n-gram precision, $BP$ = brevity penalty (punishes short outputs), $w_{n} = 1/ N$ typically; ranges 0–1 (higher is better); BLEU-4 (up to 4-grams) is most common.

Applications

Machine translation, text summarization, question answering (generative), code generation, speech-to-text (with encoder over audio features).

Trade-offs

Vanilla seq2seq bottlenecks all information through a fixed-size context vector (the “information bottleneck”); attention mechanism (see attention_mechanism) resolves this; beam search with large $B$ is more accurate but slower; BLEU is imperfect (no semantic understanding, language-dependent tokenization).

Notes

Explorer

sequence_to_sequence

Sequence-to-Sequence Models

Definition

Intuition

Formal Description

Applications

Trade-offs

Links

Graph View

Table of Contents

Backlinks