Multimodal Models
Core Idea
Multimodal models jointly encode and align information from two or more data modalities — most commonly images and text. The central challenge is building a shared representation space where semantically related content from different modalities is nearby, enabling cross-modal retrieval, visual question answering, and image captioning.
Mathematical Formulation
Contrastive Vision-Language Pre-training (CLIP)
CLIP (Radford et al., 2021) trains an image encoder and a text encoder on (image, text) pairs using InfoNCE loss:
where and is a learnable temperature.
After pre-training, zero-shot classification is performed by computing cosine similarity between the image embedding and text embeddings for each class label.
Fusion Strategies
Early fusion: concatenate raw inputs or low-level features before processing.
Late fusion: process each modality independently and combine final representations:
where is a learned combiner (e.g., MLP, cross-attention).
Cross-attention fusion: one modality attends to the other:
with from one modality and from another. Used in Visual Question Answering (VQA) models.
Visual Question Answering (VQA)
Given image and question (text), predict answer :
- Extract visual features: (feature map grid or object detections).
- Encode question: .
- Cross-attend over , then classify over answer vocabulary.
Inductive Bias
- Contrastive alignment: assumes semantically matching image-text pairs should have similar embeddings; relies on large datasets of image-caption pairs for pre-training.
- Cross-attention fusion: assumes relevant visual regions can be identified from the query modality; spatially explicit.
Training Objective
Pre-training typically uses one or more of:
- Contrastive (InfoNCE): aligns matching pairs, pushes away non-matching.
- Masked language modelling (MLM) conditioned on image features.
- Image-text matching (ITM): binary classification of whether image and text correspond.
Strengths
- Zero-shot transfer: CLIP can classify images into novel categories by describing them in text.
- Generalises across vision tasks by leveraging language supervision.
- Enables multimodal retrieval (image → text, text → image).
Weaknesses
- Requires very large-scale data (400M+ image-text pairs for CLIP).
- Contrastive objectives may align surface-level patterns rather than deep semantic content.
- Fusion is non-trivial: naively combining modalities often underperforms single-modality baselines.
- Compositional understanding is a known weakness (e.g., “a blue circle above a red square” vs “a red circle above a blue square”).
Variants
- BLIP-2: adds a lightweight Querying Transformer (Q-Former) bridging frozen image encoder and LLM.
- Flamingo: interleaves cross-attention layers into a frozen LLM to inject visual conditioning.
- LLaVA: instruction-tuned vision-language model using CLIP image encoder + LLaMA/Vicuna LLM.
- ImageBind: aligns six modalities (image, text, audio, depth, thermal, IMU) in a single embedding space.
References
- Radford, A. et al. (2021). “Learning Transferable Visual Models From Natural Language Supervision.” ICML (CLIP).
- Li, J. et al. (2023). “BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models.” ICML.
- Alayrac, J-B. et al. (2022). “Flamingo: a Visual Language Model for Few-Shot Learning.” NeurIPS.