Part 2 - Week 3

Mrinmaya Sachan
Published

Tuesday, April 15, 2025

Multimodal Models

Motivation

  • Language-only limitations:
    • Even with massive corpora, purely textual pretraining may be insufficient to build a robust world model.
    • Many real-world concepts are grounded in perception and action; language alone may not capture these.
    • Bender & Koller, 2020 (“Climbing towards NLU”) argue that without grounding, models risk “stochastic parroting.”
  • Practical drivers:
    • Internet content is inherently multimodal (images, audio, video, text).
    • Non-text modalities (vision, audio) have higher information bandwidth than text.
    • Multimodal prompts enable richer conditioning for downstream tasks.

Vision–Language Models (VLMs)

Task Categories

  • Image–Text tasks:
    • Visual Question Answering (VQA) and visual reasoning.
    • Image captioning.
    • Image–text retrieval (both directions).
    • Visual grounding (localizing objects in images from text queries).
    • Text-to-image generation.
  • Computer Vision tasks with language supervision:
    • Object detection, segmentation, classification — improved via language-aligned supervision (e.g., CLIP).
  • Video–Text tasks:
    • Video captioning, video QA, temporal grounding — require modeling temporal dynamics.

Core Architectural Components

  1. Text encoder:
    • Produces \(N\) textual feature vectors (tokens).
    • Often a Transformer encoder (BERT, RoBERTa, etc.).
  2. Image encoder:
    • Produces \(M\) visual feature vectors (patches, region proposals).
    • Architectures:
      • Object detection–based: R-CNN, Faster R-CNN — extract region features + bounding boxes.
      • CNN-based: ResNet, EfficientNet — global or patch-level features.
      • Vision Transformers (ViT, Dosovitskiy et al., 2020):
        • Flatten image into fixed-size patches → linear projection → positional encoding → Transformer encoder.
        • Treats patches as “visual tokens.”
  3. Multimodal fusion module:
    • Learns cross-modal interactions.
  4. Decoder:
    • Generates output text (for generative tasks) or classification logits (for discriminative tasks).

Multimodal Fusion Techniques

  • Merged attention (single-stream):
    • Concatenate text and image tokens → feed into a single Transformer stack.
    • All self-attention layers operate jointly over both modalities.
    • Example: VisualBERT (Li et al., 2019).
  • Co-attention (dual-stream):
    • Separate Transformer stacks for each modality.
    • Cross-attention layers exchange information between modalities.
    • Examples:
      • LXMERT (Tan & Bansal, 2019).
      • Flamingo (Alayrac et al., DeepMind 2022) — interleaves frozen pretrained LM with cross-attention “gated” vision layers.

Pretraining Strategies

Alignment Objectives

  1. Language modeling with visual context:

    • Predict next token (causal LM) or masked tokens (MLM) given both text and image features.
    • Extends BERT/GPT objectives to multimodal inputs.
  2. Image–Text Matching (ITM):

    • Binary classification: does this image match this caption?
    • Use [CLS] token representation → binary classifier.
    • Often trained with hard negatives (mismatched but semantically similar pairs).
  3. Image–Text Contrastive Learning (ITC):

    • Learn a joint embedding space for images and text.
    • Given \(N\) image–caption pairs in a batch, treat the \(N\) correct pairs as positives and the \(N^2 - N\) others as negatives.
    • Optimize symmetric InfoNCE loss: \[ \mathcal{L} = -\frac{1}{N} \sum_{i=1}^N \log \frac{\exp(\text{sim}(v_i, t_i)/\tau)}{\sum_{j=1}^N \exp(\text{sim}(v_i, t_j)/\tau)} \] where \(\text{sim}\) is cosine similarity and \(\tau\) is a temperature parameter.
    • CLIP (Radford et al., OpenAI 2021) is the canonical example.
  4. Masked Image Modeling (MIM):

    • Mask random image patches and predict them from remaining patches + text context.
    • Analogous to MLM in language.
    • Implementation depends on visual encoder (e.g., ViT with MAE-style reconstruction).

Notable Models

  • VisualBERT (Li et al., 2019):
    • Single-stream Transformer.
    • Pretrained with MLM + ITM.
  • LXMERT (Tan & Bansal, 2019):
    • Dual-stream with cross-attention.
    • Pretrained with MLM, ITM, and visual feature prediction.
  • CLIP (Radford et al., 2021):
    • Dual encoders (image + text) trained with ITC.
    • Projects both modalities into a shared embedding space.
    • Enables zero-shot transfer to many vision tasks by text prompt engineering.
  • ALIGN (Jia et al., 2021):
    • Similar to CLIP but trained on larger noisy web data.
  • Flamingo (Alayrac et al., 2022):
    • Frozen pretrained LM (e.g., Chinchilla) + vision encoder + cross-attention layers.
    • Supports interleaved image–text sequences.

Benefits of Multimodal Pretraining

  • Improves performance on multimodal tasks and can transfer to unimodal (pure text) tasks.
  • Provides grounding for language models, potentially improving factuality and reasoning about the physical world.
  • Enables zero-shot and few-shot capabilities in vision tasks via text prompts.