Week 8

Published

Wednesday, April 9, 2025

Normalizing Flows (NFs)

Core Concepts

Definition
- Normalizing Flows (NFs) are generative models that represent a complex data distribution \(p(x)\) by transforming a simple base distribution \(p_Z(z)\) (e.g., Gaussian) through an invertible and differentiable function \(f: \mathcal{Z} \to \mathcal{X}\), where \(x = f(z)\).
Tractable Likelihood
- NFs allow exact likelihood computation using the change of variables formula. Given \(z = f^{-1}(x)\): \[ p_X(x) = p_Z(z) \left| \det \left( \frac{\partial z}{\partial x} \right) \right| = p_Z(f^{-1}(x)) \left| \det J_{f^{-1}}(x) \right| \]
- Alternatively, by the inverse function theorem (\(J_f(z) = (J_{f^{-1}}(x))^{-1}\)): \[ p_X(x) = p_Z(z) \left| \det J_f(z) \right|^{-1} \]
Requirements for Transformation \(f\)
- Invertible: \(f^{-1}\) must exist.
- Differentiable: Both \(f\) and \(f^{-1}\) must be differentiable (i.e., \(f\) is a diffeomorphism).
- Efficient Jacobian Determinant: The determinant of the Jacobian matrix (\(\det J_f\) or \(\det J_{f^{-1}}\)) must be computationally efficient (ideally \(O(D)\) or \(O(D^2)\) for \(D\) dimensions, not \(O(D^3)\) as in the general case). This is often achieved by using triangular or structured Jacobians.
- Dimensionality Preservation: The input and output dimensions must match, i.e., \(\dim(x) = \dim(z)\).
Composition
- Complex transformations are constructed by composing multiple simpler invertible layers: \(f = f_L \circ \dots \circ f_1\).
- The overall log-determinant is the sum of the log-determinants of each layer: \[ \log p_X(x) = \log p_Z(z) + \sum_{i=1}^L \log \left| \det J_{f_i^{-1}}(x_i) \right| \] where \(x_i\) is the input to layer \(f_i^{-1}\).
Inference vs. Sampling
- Likelihood Evaluation (Inference): Compute \(z = f^{-1}(x)\) and evaluate \(p_X(x)\) using the above formula.
- Sampling: Sample \(z \sim p_Z(z)\) and compute \(x = f(z)\).

Key Building Block: Coupling Layers

Concept
- Partition input dimensions \(x\) into two parts, \(x = (x_a, x_b)\). Transform one part (\(x_a\)) based on the other (\(x_b\)), while leaving \(x_b\) unchanged: \[ y_a = h(x_a; \theta(x_b)) \] \[ y_b = x_b \]
- \(h\) must be invertible with respect to \(x_a\).
- \(\theta(x_b)\) can be arbitrarily complex and does not have to be invertible.
Example: Affine Coupling Layer
- \(h(x_a; s, t) = x_a \odot s + t\), where \((s, t) = \theta(x_b)\).
Forward Pass
- \(y_a = h(x_a, \theta(x_b))\)
- \(y_b = x_b\)
Backward Pass
- \(x_a = h^{-1}(y_a, \theta(y_b))\)
- \(x_b = y_b\)
Jacobian Matrix
- The Jacobian is block lower triangular: \[ J = \frac{\partial y}{\partial x} = \begin{pmatrix} \frac{\partial y_a}{\partial x_a} & \frac{\partial y_a}{\partial x_b} \\ 0 & I \end{pmatrix} \]
Determinant
- The determinant is efficient to compute: \[ \det(J) = \det \left( \frac{\partial y_a}{\partial x_a} \right) = \det \left( \frac{\partial h(x_a; \theta(x_b))}{\partial x_a} \right) \]
- For affine coupling layers, \(\frac{\partial y_a}{\partial x_a} = \mathrm{diag}(s)\), so \(\det(J) = \prod_i s_i\).
Conditioning
- To make the flow conditional on external information \(c\), modify the parameter network: \((s, t) = \theta(x_b, c)\). The Jacobian structure remains unchanged.
Transformation Composition
- The overall transformation is a composition of all the transformations, and the determinants multiply (log-determinants sum).
- Since one part of the input is untransformed in each layer, the next layer should transform this part, or at each step, a different mask of features should be used.
Training
- Maximize the exact log-likelihood of i.i.d. samples, so the density is explicitly defined.
Inference
- To sample, draw \(z\) from the base distribution and pass it through the transformations.
- To compute the sample probability, use invertibility to transform \(x\) back to \(z\) and measure its probability.
Continuous Normalizing Flows
- Allow modeling of arbitrarily complex distributions by parameterizing the transformation as a continuous-time ODE.

Architectural Patterns (e.g., RealNVP, GLOW)

General Patterns
- The main differences in architectures are in the choice of \(h\) and how variables are split.
NICE Model
- An early flow-based model using an additive coupling layer and splitting variables as \(x_{1:d}\) and \(x_{d+1:D}\), where \(D\) is the input dimension.
Multi-Scale Architecture
- Employs sequences of blocks operating at different spatial resolutions.

RealNVP

Utilizes a combination of spatial checkerboard pattern, channel-wise masking, and affine mapping.
The split operation partitions the input data using a spatial checkerboard pattern, followed by a squeeze operation and channel-wise masking.
The squeeze operation reduces the spatial dimensions while increasing the number of channels. For an input tensor of size \(W \times H \times C\), the squeeze operation transforms it to \(W/2 \times H/2 \times 4C\) by dividing the input into \(2 \times 2\) subsquares and assigning elements to different channels in a clockwise rotation.
The coupling layer uses an affine mapping: \[ y_A = x_A \odot \exp(s(x_B)) + t(x_B) \] \[ y_B = x_B \] where \(s\) and \(t\) can be arbitrarily complex (e.g., neural networks), \(\odot\) denotes element-wise product, and \(y_A\), \(y_B\) are the resulting partitions.
The Jacobian is triangular, so its log-determinant is efficiently computed as \(\sum s(x_B)\).

GLOW

Introduced invertible \(1 \times 1\) convolutions and affine coupling layers.
Each flow step consists of:
1. Activation Normalization (ActNorm)
  - Normalizes each input channel by learning a per-channel scale \(s\) and bias \(b\).
  - Forward: \(y_{i,j} = s \odot x_{i,j} + b\)
  - Reverse: \(x_{i,j} = (y_{i,j} - b)/s\)
  - Log-determinant: \(H \cdot W \cdot \sum \log(|s|)\)
2. Invertible \(1 \times 1\) Convolution
  - Generalizes permutation in the channel dimension.
  - The convolution weight matrix \(W \in \mathbb{R}^{C \times C}\) is initialized as a random rotation matrix with \(\det(W) = 1\).
  - To compute the determinant efficiently, \(W\) is parameterized using LU decomposition: \[ W = P \cdot L \cdot (U + \mathrm{diag}(s)) \] where \(P\) is a fixed permutation matrix, \(L\) is lower triangular with ones on the diagonal, \(U\) is upper triangular, and \(s\) is a vector.
  - Log-determinant: \(\sum \log |s|\)
3. (Conditional) Coupling Layer
  - As in RealNVP, but with channel-wise splitting.
  - \(x_A, x_B = \mathrm{split}(x)\)
  - \((\log s, t) = \mathrm{NN}(x_B)\)
  - \(s = \exp(\log s)\)
  - \(y_A = s \odot x_A + t\)
  - \(y_B = x_B\)
  - \(y = \mathrm{concat}(y_A, y_B)\)
Split Operation
- After a block, a fraction of dimensions (channels) can be split off and passed directly to the latent space \(z\), reducing computational cost in subsequent layers.

Notable NF Models

GLOW
- Introduced invertible \(1 \times 1\) convolutions, achieving high-quality image generation for NFs at the time.
SRFlow
- Uses a conditional NF for super-resolution, modeling \(p(x_{HR} | x_{LR})\).
StyleFlow
- Enables controlled, disentangled editing of images by applying a conditional NF to the latent space (\(\mathcal{W}\) or \(\mathcal{W}+\)) of a pre-trained StyleGAN generator.
- Learns \(w' = f(w; a_{target})\) where \(w\) is the original StyleGAN latent, \(a_{target}\) are desired attributes, and \(w'\) is the edited latent.
- Training uses triplets \(\{w, G(w), A(G(w))\}\), where \(G\) is the StyleGAN generator and \(A\) is an attribute predictor.
- The “forward” pass \(w \to w'\) performs the edit. The “reverse” pass \(w = f^{-1}(w'; a_{target})\) is also possible.

Conditional and Multimodal Flows

Conditional NFs
- Model \(p(x|c)\) by making transformations dependent on conditioning variable \(c\). Used in SRFlow, StyleFlow, etc.
C-Flows (Conditional Flows for Cross-Domain Generation)
- Link multiple data modalities (e.g., images \(x_1\), point clouds \(x_2\)) through a shared latent space \(z\).
- Learn flows \(f_1: \mathcal{X}_1 \to \mathcal{Z}\) and \(f_2: \mathcal{X}_2 \to \mathcal{Z}\).
- Allows generating data in one modality conditioned on another, e.g., generate \(x_2\) from \(x_1\) by computing \(z = f_1^{-1}(x_1)\) and then sampling \(x_2 = f_2(z)\).
- Training can be joint or conditional.

Limitations

Dimensionality Preservation
- Input and output dimensions must match, which can be computationally demanding for high-dimensional data, although techniques like splitting and multi-scale architectures mitigate this.
Sample Quality
- While offering tractable likelihoods, NFs sometimes lag behind Generative Adversarial Networks (GANs) and Diffusion Models in terms of photorealism for complex image generation tasks.
Topology
- Basic NFs cannot change the topology of the space; they are diffeomorphisms. [May not be a practical limitation for many tasks].

Applications in Computer Vision

Super-Resolution
- Learns a distribution of high-resolution variants.
Disentanglement
- Enables disentangled representation learning.
Multimodal Modeling
- Models joint or conditional distributions over multiple modalities.
3D Shape Modeling
- Models complex 3D shapes.
3D Pose Estimation
- Models distributions over 3D poses.
Regularization
- Provides explicit log-likelihoods for regularization in other models.