Sequence-to-Sequence Models

Definition

Encoder-decoder architectures that map a variable-length input sequence to a variable-length output sequence; combined with beam search for decoding and BLEU for evaluation.

Intuition

The encoder summarizes the input sequence into a context representation; the decoder generates the output sequence one token at a time, conditioned on the context; beam search avoids greedy decoding by maintaining multiple candidate sequences.

Formal Description

Seq2seq: encoder RNN reads producing final hidden state (context vector) ; decoder RNN generates , initialized with ; modeled by the decoder at each step; unlike language model, conditioned on input.

Machine translation as conditional LM: ; training: teacher forcing (feed ground-truth previous tokens); inference: autoregressive decoding.

Beam search: at each decoding step, keep top partial sequences (by cumulative log-prob); at step , expand each of beams by all vocab words, score, keep top ; length normalization: divide by with to prevent preference for short sequences.

BLEU score (Bilingual Evaluation Understudy): measures n-gram precision of machine translation against reference translations; where = modified n-gram precision, = brevity penalty (punishes short outputs), typically; ranges 0–1 (higher is better); BLEU-4 (up to 4-grams) is most common.

Applications

Machine translation, text summarization, question answering (generative), code generation, speech-to-text (with encoder over audio features).

Trade-offs

Vanilla seq2seq bottlenecks all information through a fixed-size context vector (the “information bottleneck”); attention mechanism (see attention_mechanism) resolves this; beam search with large is more accurate but slower; BLEU is imperfect (no semantic understanding, language-dependent tokenization).