Part 2 - Week 3

Mrinmaya Sachan

Published

Tuesday, April 15, 2025

Multimodal Models

Language-only limitations:
- Even with massive corpora, purely textual pretraining may be insufficient to build a robust world model.
- Many real-world concepts are grounded in perception and action; language alone may not capture these.
- Bender & Koller, 2020 (“Climbing towards NLU”) argue that without grounding, models risk “stochastic parroting.”
Practical drivers:
- Internet content is inherently multimodal (images, audio, video, text).
- Non-text modalities (vision, audio) have higher information bandwidth than text.
- Multimodal prompts enable richer conditioning for downstream tasks.

Image–Text tasks:
- Visual Question Answering (VQA) and visual reasoning.
- Image captioning.
- Image–text retrieval (both directions).
- Visual grounding (localizing objects in images from text queries).
- Text-to-image generation.
Computer Vision tasks with language supervision:
- Object detection, segmentation, classification — improved via language-aligned supervision (e.g., CLIP).
Video–Text tasks:
- Video captioning, video QA, temporal grounding — require modeling temporal dynamics.

Text encoder:
- Produces \(N\) textual feature vectors (tokens).
- Often a Transformer encoder (BERT, RoBERTa, etc.).
Image encoder:
- Produces \(M\) visual feature vectors (patches, region proposals).
- Architectures:
  - Object detection–based: R-CNN, Faster R-CNN — extract region features + bounding boxes.
  - CNN-based: ResNet, EfficientNet — global or patch-level features.
  - Vision Transformers (ViT, Dosovitskiy et al., 2020):
    - Flatten image into fixed-size patches → linear projection → positional encoding → Transformer encoder.
    - Treats patches as “visual tokens.”
Multimodal fusion module:
- Learns cross-modal interactions.
Decoder:
- Generates output text (for generative tasks) or classification logits (for discriminative tasks).

Merged attention (single-stream):
- Concatenate text and image tokens → feed into a single Transformer stack.
- All self-attention layers operate jointly over both modalities.
- Example: VisualBERT (Li et al., 2019).
Co-attention (dual-stream):
- Separate Transformer stacks for each modality.
- Cross-attention layers exchange information between modalities.
- Examples:
  - LXMERT (Tan & Bansal, 2019).
  - Flamingo (Alayrac et al., DeepMind 2022) — interleaves frozen pretrained LM with cross-attention “gated” vision layers.

Language modeling with visual context:
- Predict next token (causal LM) or masked tokens (MLM) given both text and image features.
- Extends BERT/GPT objectives to multimodal inputs.
Image–Text Matching (ITM):
- Binary classification: does this image match this caption?
- Use [CLS] token representation → binary classifier.
- Often trained with hard negatives (mismatched but semantically similar pairs).
Image–Text Contrastive Learning (ITC):
- Learn a joint embedding space for images and text.
- Given \(N\) image–caption pairs in a batch, treat the \(N\) correct pairs as positives and the \(N^2 - N\) others as negatives.
- Optimize symmetric InfoNCE loss: \[ \mathcal{L} = -\frac{1}{N} \sum_{i=1}^N \log \frac{\exp(\text{sim}(v_i, t_i)/\tau)}{\sum_{j=1}^N \exp(\text{sim}(v_i, t_j)/\tau)} \] where \(\text{sim}\) is cosine similarity and \(\tau\) is a temperature parameter.
- CLIP (Radford et al., OpenAI 2021) is the canonical example.
Masked Image Modeling (MIM):
- Mask random image patches and predict them from remaining patches + text context.
- Analogous to MLM in language.
- Implementation depends on visual encoder (e.g., ViT with MAE-style reconstruction).

VisualBERT (Li et al., 2019):
- Single-stream Transformer.
- Pretrained with MLM + ITM.
LXMERT (Tan & Bansal, 2019):
- Dual-stream with cross-attention.
- Pretrained with MLM, ITM, and visual feature prediction.
CLIP (Radford et al., 2021):
- Dual encoders (image + text) trained with ITC.
- Projects both modalities into a shared embedding space.
- Enables zero-shot transfer to many vision tasks by text prompt engineering.
ALIGN (Jia et al., 2021):
- Similar to CLIP but trained on larger noisy web data.
Flamingo (Alayrac et al., 2022):
- Frozen pretrained LM (e.g., Chinchilla) + vision encoder + cross-attention layers.
- Supports interleaved image–text sequences.

Improves performance on multimodal tasks and can transfer to unimodal (pure text) tasks.
Provides grounding for language models, potentially improving factuality and reasoning about the physical world.
Enables zero-shot and few-shot capabilities in vision tasks via text prompts.