Week 9

Published

Wednesday, April 16, 2025

Diffusion Models

  • Definition: Diffusion models are explicit generative models that gradually add noise to data and learn to reverse this process to generate new samples.

  • Core Processes:

    • Forward Process: Systematically adds Gaussian noise to the original data over \(T\) steps.
    • Reverse Process: Learns to denoise step-by-step, generating new samples from noise.
  • Training Objective: The model is trained to predict the noise added at each step, not the denoised data directly.

  • Advantages Over Other Generative Approaches

    • Normalizing Flows: Require invertibility and tractable Jacobians; diffusion models do not have this constraint.
    • Variational Autoencoders (VAEs) and Autoregressive Models: VAEs and autoregressive models often produce lower-quality samples compared to diffusion models.
    • Generative Adversarial Networks (GANs): GANs are difficult to train and prone to mode collapse; diffusion models offer more stable training and higher sample diversity.
    • Hybridization: Diffusion models can be combined with GANs to enhance realism while maintaining stability.
  • Training Methodology

    • Self-supervised Paradigm: The process of adding controlled noise acts as a natural supervision signal.
    • Noise Scheduling:
      • Progressive Approach: Start with minimal noise and gradually increase it.
      • Schedulers: Linear and cosine schedules are common. Linear schedules can cause the image to become pure noise too quickly, making learning difficult in later steps.
    • Computational Characteristics:
      • Generation requires multiple passes (one per denoising step), similar to autoregressive models.
  • Architecture

    • Model Core: U-Net architecture is used, shared across all timesteps.
    • Time Representation: The timestep is encoded (e.g., sinusoidal encoding) and input to the model.
    • Conditioning Approaches:
      • Classifier Guidance:
        • Adjusts noise prediction using a classifier trained on noisy images.
        • Can be added post-training.
        • Limitations:
          • Requires a classifier trained for all noise levels.
          • May introduce additional noise.
          • Limited to predefined class sets.
      • Direct Conditioning:
        • Embeds class labels or text directly into the model.
        • More flexible and effective than classifier guidance.
        • Must be implemented during training.
      • Classifier-Free Guidance (CFG):
        • Conditioning signal (e.g., text embedding) is applied only a fraction (e.g., 80%) of the time during training.
        • Model learns both conditional and unconditional denoising.
        • Increases output diversity.
        • Typically uses a CLIP embedding (Contrastive Language-Image Pre-training).
        • A hyperparameter controls the guidance strength.
        • Drawback: Slower generation, as both conditional and unconditional predictions are needed at each step.
  • Efficiency Enhancements

    • Latent Diffusion Models (LDMs):
      • Diffusion is performed in a compressed latent space, not pixel space.
      • Pipeline: Encoder \(\rightarrow\) Diffusion in latent space \(\rightarrow\) Decoder.
      • Focuses on semantic content, not pixel-level details.
      • Key Idea: A Variational Autoencoder (VAE) is pre-trained to map images to a low-dimensional latent space. The VAE is frozen during diffusion model training.
      • Benefits: Operating in latent space reduces dimensionality and computational cost, improving efficiency.
      • Applications: Powers state-of-the-art systems such as Stable Diffusion, DALL-E, and Sora. [Verification Needed: Sora’s exact architecture.]
  • Advanced Applications

    • Image-Conditioned Generation:
      • Conditioning on images (e.g., sketches, structures).
      • ControlNet:
        • Feature modulation approach.
        • Base diffusion model is frozen.
        • Parallel adaptation block with “zero convolution” layers (1x1 conv, zero-initialized).
        • Zero initialization ensures gradual adaptation without disrupting the base model.
      • Photo Relighting: Modifies lighting while preserving content.
  • Mathematical Foundation

    • Forward Diffusion Process

      • Modeled as a Markov process with \(T\) steps (typically \(T \approx 1000\)).
      • At each step \(t\), Gaussian noise is added: \[ q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t\mathbf{I}) \] where \(\beta_t\) is the noise schedule parameter.
      • Define \(\alpha_t = 1 - \beta_t\) and \(\bar{\alpha}_t = \prod_{i=1}^{t} \alpha_i\).
      • Using the reparameterization trick, sample \(x_t\) given \(x_0\): \[ x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon, \quad \epsilon \sim \mathcal{N}(0, \mathbf{I}) \] \[ q(x_t | x_0) = \mathcal{N}(\sqrt{\bar{\alpha}_t} x_0, (1-\bar{\alpha}_t) \mathbf{I}) \]
    • Reverse Diffusion Process

      • The goal is to learn \(p_\theta(x_{t-1}|x_t)\) to approximate the true reverse process.
      • By Bayes’ rule: \[ q(x_{t-1}|x_t,x_0) = \frac{q(x_t|x_{t-1},x_0)q(x_{t-1}|x_0)}{q(x_t|x_0)} \]
      • All terms are Gaussian, so \(q(x_{t-1}|x_t,x_0)\) is Gaussian with:
        • Mean: \[ \tilde{\mu}_t(x_t,x_0) = \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}x_t + \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_t}x_0 \]
        • Variance: \[ \tilde{\beta}_t = \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\beta_t \]
  • Training Objective

    • Model \(p_\theta(x_{t-1}|x_t)\) as a Gaussian with:

      • Fixed variance \(\tilde{\beta}_t\) (from the forward process).
      • Learned mean \(\mu_\theta(x_t,t)\) (predicted by a neural network).
    • ELBO Derivation:

      • Maximize the log-likelihood \(\log p_\theta(x_0)\) via the Evidence Lower Bound (ELBO): \[ \log p_\theta(x_0) \geq \mathbb{E}_{q(x_{1:T}|x_0)}\left[\log\frac{p_\theta(x_{0:T})}{q(x_{1:T}|x_0)}\right] = \mathcal{L}_\text{ELBO} \]
      • Expand the joint distributions: \[ p_\theta(x_{0:T}) = p(x_T) \prod_{t=1}^{T} p_\theta(x_{t-1}|x_t) \]
      • Rewrite the forward process using Bayes’ rule:
        • Start with the chain rule factorization: \[ q(x_{1:T}|x_0) = q(x_T|x_0) \cdot q(x_{T-1}|x_T, x_0) \cdot q(x_{T-2}|x_{T-1}, x_T, x_0) \cdots q(x_1|x_2, \ldots, x_T, x_0) \]
        • Apply the Markov property: given \(x_{t+1}\) and \(x_0\), \(x_t\) is independent of future states: \[ q(x_t|x_{t+1}, x_{t+2}, \ldots, x_T, x_0) = q(x_t|x_{t+1}, x_0) \]
        • This gives us the key factorization: \[ q(x_{1:T}|x_0) = q(x_T|x_0) \prod_{t=1}^{T-1} q(x_t|x_{t+1}, x_0) \]
      • Substitute and rearrange: \[ \mathcal{L}_\text{ELBO} = \mathbb{E}_q\left[\log\frac{p(x_T) \prod_{t=1}^{T} p_\theta(x_{t-1}|x_t)}{q(x_T|x_0) \prod_{t=1}^{T-1} q(x_t|x_{t+1}, x_0)}\right] \] \[ = \mathbb{E}_q\left[\log\frac{p(x_T)}{q(x_T|x_0)} + \log p_\theta(x_0|x_1) + \sum_{t=1}^{T-1}\log\frac{p_\theta(x_t|x_{t+1})}{q(x_t|x_{t+1}, x_0)}\right] \]
      • Decomposed into three terms: \[ \mathcal{L}_\text{ELBO} = \mathcal{L}_T + \mathcal{L}_0 + \mathcal{L}_{1:T-1} \]
        • Prior Matching Term (\(\mathcal{L}_T\)): \[ \mathcal{L}_T = \mathbb{E}_q\left[\log \frac{p(x_T)}{q(x_T|x_0)}\right] = -D_\text{KL}(q(x_T|x_0) \| p(x_T)) \]
          • For large \(T\), \(q(x_T|x_0) \approx \mathcal{N}(0,\mathbf{I})\) and \(p(x_T) = \mathcal{N}(0,\mathbf{I})\), so this term is negligible.
        • Reconstruction Term (\(\mathcal{L}_0\)): \[ \mathcal{L}_0 = \mathbb{E}_{q(x_1|x_0)}[\log p_\theta(x_0|x_1)] \]
          • Measures how well the final denoising step reconstructs the original data.
        • Denoising Matching Terms (\(\mathcal{L}_{1:T-1}\)): \[ \mathcal{L}_{1:T-1} = \sum_{t=1}^{T-1}\mathbb{E}_q\left[\log \frac{p_\theta(x_t|x_{t+1})}{q(x_t|x_{t+1}, x_0)}\right] = -\sum_{t=1}^{T-1}D_\text{KL}(q(x_t|x_{t+1}, x_0) \| p_\theta(x_t|x_{t+1})) \]
          • Measures how well the learned reverse process matches the true denoising transitions.
    • Simplified Training Objective:

      • If \(p_\theta(x_{t-1}|x_t)\) is Gaussian with the same variance as \(q(x_{t-1}|x_t,x_0)\), the KL divergence in \(\mathcal{L}_{1:T-1}\) reduces to the squared error between means.
      • The training loss becomes: \[ \mathcal{L} = \mathbb{E}_{t,x_0,\epsilon} \left[ \|\epsilon - \epsilon_\theta(x_t, t)\|^2 \right] \] \[ \left\|\epsilon-\hat{\epsilon}_{\Theta}\left(\sqrt{\overline{\alpha_t}} \mathbf{x}_{0}+\sqrt{1-\overline{\alpha_t}} \epsilon, t\right)\right\|^2 \]
      • \(\epsilon_\theta(x_t, t)\) is the network’s prediction of the noise component.
    • Training Algorithm

      \[ \begin{aligned} &\text{Algorithm: Diffusion Model Training} \\ &\text{while not converged do} \\ &\quad \mathbf{x}_0 \sim q(\mathbf{x}_0) \\ &\quad t \sim \mathcal{U}(\{1, \ldots, T\}) \\ &\quad \epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) \\ &\quad \mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t} \epsilon \\ &\quad \text{Take gradient step on } \nabla_{\Theta}\left\|\epsilon-\epsilon_{\Theta}\left(\mathbf{x}_t, t\right)\right\|^2 \\ &\text{end while} \end{aligned} \]

    • Sampling Algorithm

      • Start from \(x_T \sim \mathcal{N}(0, \mathbf{I})\).
      • For \(t = T\) down to \(1\):
        • Predict noise: \(\hat{\epsilon} = \epsilon_\theta(x_t, t)\).
        • Compute mean: \[ \mu_\theta(x_t, t) = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\hat{\epsilon}\right) \]
        • If \(t > 1\), sample: \[ x_{t-1} = \mu_\theta(x_t, t) + \sqrt{\tilde{\beta}_t}z, \quad z \sim \mathcal{N}(0, \mathbf{I}) \]
        • Else, set \(x_0 = \mu_\theta(x_1, 1)\).
  • Additional Resources

    • What are diffusion models?
    • Tutorial on vision diffusion models
    • Understanding diffusion models

Foundation Models

  • General Concept

    • Train a large network to perform well on a wide range of tasks, often with minimal or no task-specific fine-tuning.
  • Historical Progression

    • First Generation: ELMo, BERT, ERNIE
      • Model is pre-trained, then a task-specific decoder is added.
    • Second Generation: GPT-3
      • Generalized model, fine-tuned for specific tasks.
    • Third Generation: ChatGPT, DeepSeek
      • Zero-shot and few-shot learning; models are instruction-tuned and can generalize to new tasks with little or no additional training.
  • Vision Foundation Models

    • Tokenization: Images are split into patches, which are linearly projected and fed into a transformer.
    • Vision Transformer (ViT, 2020):
      • Linear projection of flattened patches.
      • Uses a transformer encoder.
      • Includes a learnable class embedding token (CLS token) to represent the whole image.
    • Masked Autoencoder (MAE) Pretraining:
      • Randomly masks patches and trains the model to reconstruct them.
      • Enables joint prediction, depth estimation, and image inpainting. [Verification Needed: “joint prediction” is ambiguous.]
    • SAM (Segment Anything Model, Meta):
      • Promptable segmentation model.
      • Requires large-scale labeled data, often refined with human-in-the-loop annotation.
    • CLIP:
      • Shared embedding space for image and text.
      • Trained with contrastive learning on image-text pairs.
      • Enables dot product similarity for retrieval and zero-shot classification.
    • DINO:
      • Self-distillation without labels.
      • Student matches the output of a teacher (an EMA of the student).
      • Teacher sees the whole image; student sees a crop.
      • Teacher is fixed and updated only via the student for stability.
      • Enables extraction of general features with zero supervision.
  • Multimodal Models

    • 4M (EPFL) - Massively Multimodal Masked Modeling:
      • Any-to-any model: can handle image-to-text, normals, depth, 3D, etc.
      • All modalities are tokenized, randomly masked, and the model is trained to predict all modalities.
      • Requires knowledge of labels for each modality.
      • Pseudo-labeling can be used, leveraging state-of-the-art models for each task.
    • DreamBooth:
      • Customized text-to-image generation.
      • Fine-tunes a diffusion model on a few examples of a specific subject, associating it with a unique token.
    • Zero1to3:
      • Fine-tunes a diffusion model with 3D data.
      • From a single image, generates multiple views (novel view synthesis).