Multimodal Models

Core Idea

Multimodal models jointly encode and align information from two or more data modalities — most commonly images and text. The central challenge is building a shared representation space where semantically related content from different modalities is nearby, enabling cross-modal retrieval, visual question answering, and image captioning.

Mathematical Formulation

Contrastive Vision-Language Pre-training (CLIP)

CLIP (Radford et al., 2021) trains an image encoder $f_{I}$ and a text encoder $f_{T}$ on $N$ (image, text) pairs using InfoNCE loss:

L = - \frac{1}{N} i = 1 \sum N [lo g \frac{exp ( sim ( f _{I} ( v _{i} ) , f _{T} ( t _{i} )) / τ )}{\sum _{j = 1}^{N} exp ( sim ( f _{I} ( v _{i} ) , f _{T} ( t _{j} )) / τ )}]

where $sim (u, v) = u^{⊤} v / (∥ u ∥∥ v ∥)$ and $τ$ is a learnable temperature.

After pre-training, zero-shot classification is performed by computing cosine similarity between the image embedding and text embeddings for each class label.

Fusion Strategies

Early fusion: concatenate raw inputs or low-level features before processing.

Late fusion: process each modality independently and combine final representations:

\overset{y}{^} = g (f_{I} (v), f_{T} (t))

where $g$ is a learned combiner (e.g., MLP, cross-attention).

Cross-attention fusion: one modality attends to the other:

Attn (Q, K, V) = softmax (\frac{Q K ^{⊤}}{d _{k}}) V

with $Q$ from one modality and $K, V$ from another. Used in Visual Question Answering (VQA) models.

Visual Question Answering (VQA)

Given image $I$ and question $q$ (text), predict answer $a$ :

Extract visual features: $v = f_{I} (I) \in R^{K \times d}$ (feature map grid or object detections).
Encode question: $q = f_{T} (q) \in R^{d}$ .
Cross-attend $q$ over $v$ , then classify over answer vocabulary.

Inductive Bias

Contrastive alignment: assumes semantically matching image-text pairs should have similar embeddings; relies on large datasets of image-caption pairs for pre-training.
Cross-attention fusion: assumes relevant visual regions can be identified from the query modality; spatially explicit.

Training Objective

Pre-training typically uses one or more of:

Contrastive (InfoNCE): aligns matching pairs, pushes away non-matching.
Masked language modelling (MLM) conditioned on image features.
Image-text matching (ITM): binary classification of whether image and text correspond.

Strengths

Zero-shot transfer: CLIP can classify images into novel categories by describing them in text.
Generalises across vision tasks by leveraging language supervision.
Enables multimodal retrieval (image → text, text → image).

Weaknesses

Requires very large-scale data (400M+ image-text pairs for CLIP).
Contrastive objectives may align surface-level patterns rather than deep semantic content.
Fusion is non-trivial: naively combining modalities often underperforms single-modality baselines.
Compositional understanding is a known weakness (e.g., “a blue circle above a red square” vs “a red circle above a blue square”).

Variants

BLIP-2: adds a lightweight Querying Transformer (Q-Former) bridging frozen image encoder and LLM.
Flamingo: interleaves cross-attention layers into a frozen LLM to inject visual conditioning.
LLaVA: instruction-tuned vision-language model using CLIP image encoder + LLaMA/Vicuna LLM.
ImageBind: aligns six modalities (image, text, audio, depth, thermal, IMU) in a single embedding space.

References

Radford, A. et al. (2021). “Learning Transferable Visual Models From Natural Language Supervision.” ICML (CLIP).
Li, J. et al. (2023). “BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models.” ICML.
Alayrac, J-B. et al. (2022). “Flamingo: a Visual Language Model for Few-Shot Learning.” NeurIPS.

Notes

Explorer

multimodal_models