Multimodal Models

Core Idea

Multimodal models jointly encode and align information from two or more data modalities — most commonly images and text. The central challenge is building a shared representation space where semantically related content from different modalities is nearby, enabling cross-modal retrieval, visual question answering, and image captioning.

Mathematical Formulation

Contrastive Vision-Language Pre-training (CLIP)

CLIP (Radford et al., 2021) trains an image encoder and a text encoder on (image, text) pairs using InfoNCE loss:

where and is a learnable temperature.

After pre-training, zero-shot classification is performed by computing cosine similarity between the image embedding and text embeddings for each class label.

Fusion Strategies

Early fusion: concatenate raw inputs or low-level features before processing.

Late fusion: process each modality independently and combine final representations:

where is a learned combiner (e.g., MLP, cross-attention).

Cross-attention fusion: one modality attends to the other:

with from one modality and from another. Used in Visual Question Answering (VQA) models.

Visual Question Answering (VQA)

Given image and question (text), predict answer :

  1. Extract visual features: (feature map grid or object detections).
  2. Encode question: .
  3. Cross-attend over , then classify over answer vocabulary.

Inductive Bias

  • Contrastive alignment: assumes semantically matching image-text pairs should have similar embeddings; relies on large datasets of image-caption pairs for pre-training.
  • Cross-attention fusion: assumes relevant visual regions can be identified from the query modality; spatially explicit.

Training Objective

Pre-training typically uses one or more of:

  • Contrastive (InfoNCE): aligns matching pairs, pushes away non-matching.
  • Masked language modelling (MLM) conditioned on image features.
  • Image-text matching (ITM): binary classification of whether image and text correspond.

Strengths

  • Zero-shot transfer: CLIP can classify images into novel categories by describing them in text.
  • Generalises across vision tasks by leveraging language supervision.
  • Enables multimodal retrieval (image → text, text → image).

Weaknesses

  • Requires very large-scale data (400M+ image-text pairs for CLIP).
  • Contrastive objectives may align surface-level patterns rather than deep semantic content.
  • Fusion is non-trivial: naively combining modalities often underperforms single-modality baselines.
  • Compositional understanding is a known weakness (e.g., “a blue circle above a red square” vs “a red circle above a blue square”).

Variants

  • BLIP-2: adds a lightweight Querying Transformer (Q-Former) bridging frozen image encoder and LLM.
  • Flamingo: interleaves cross-attention layers into a frozen LLM to inject visual conditioning.
  • LLaVA: instruction-tuned vision-language model using CLIP image encoder + LLaMA/Vicuna LLM.
  • ImageBind: aligns six modalities (image, text, audio, depth, thermal, IMU) in a single embedding space.

References

  • Radford, A. et al. (2021). “Learning Transferable Visual Models From Natural Language Supervision.” ICML (CLIP).
  • Li, J. et al. (2023). “BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models.” ICML.
  • Alayrac, J-B. et al. (2022). “Flamingo: a Visual Language Model for Few-Shot Learning.” NeurIPS.