Part 1 - Week 1

Ryan Cotterell
Published

Tuesday, February 18, 2025

Measure Theory Recap

  • It is not trivial to define a language model over an uncountable set of sequences.
  • We need some measure-theoretic tools to avoid paradoxes.
  • Most technical papers aren’t rigorous enough and assume things will work out.
  • \(\Omega\) is the sample space (event space), \(\mathcal{F}\) is the sigma-algebra (measurable sets), \(\mathbb{P}\) is the probability measure.
  • A probability space is a triple \((\Omega, \mathcal{F}, \mathbb{P})\) where:
    • Definition: Sigma-algebra. \(\mathcal{F} \subseteq 2^\Omega\) is a sigma-algebra if:
      • \(\emptyset \in \mathcal{F}\).
      • \(A \in \mathcal{F} \implies A^c \in \mathcal{F}\).
      • \(A_1, A_2, \ldots\) is a countable collection in \(\mathcal{F} \implies \bigcup_{i=1}^\infty A_i \in \mathcal{F}\).
    • Examples: The trivial sigma-algebra \(\{\emptyset, \Omega\}\) or the whole power set (discrete sigma-algebra).
  • Definition: Probability measure over a measurable space \((\Omega, \mathcal{F})\) is a function \(\mathbb{P}: \mathcal{F} \to [0, 1]\) such that:
    • \(\mathbb{P}(\Omega) = 1\).
    • If \(A_1, A_2, \ldots\) are countable disjoint sets in \(\mathcal{F}\), then \(\mathbb{P}(\bigcup_{i=1}^\infty A_i) = \sum_{i=1}^\infty \mathbb{P}(A_i)\), which is called sigma-additivity.
  • We usually cannot have \(\mathcal{F} = 2^\Omega\) because then it is provably impossible to define a probability measure that would have desired properties.
  • Definition: Algebra. Like a sigma-algebra but only with finite unions.
  • Definition: Probability pre-measure. Like a probability measure but the sigma-additivity has to hold only for unions that are in the algebra (since it’s defined over an algebra).
  • Carathéodory’s extension theorem: Allows us to construct simply an algebra over the set with a pre-measure and then extend it to a sigma-algebra while making this extension minimal and unique.
  • Definition: Random variable. A function \(X: \Omega \to \mathbb{R}\) between two measurable spaces \((\Omega, \mathcal{F})\) and \((\mathbb{R}, \mathcal{B})\) (Borel sigma-algebra) is a random variable or a measurable mapping if for all sets \(B \in \mathcal{B}\), we have \(X^{-1}(B) \in \mathcal{F}\).
  • Each random variable defines a new pushforward probability measure \(\mathbb{P}_X\) on \((\mathbb{R}, \mathcal{B})\) such that \(\mathbb{P}_X(B) = \mathbb{P}(X^{-1}(B))\) for all \(B \in \mathcal{B}\).

Language Models: Distributions over Strings

  • Inspiration from formal language theory which works with sets of structures, the simplest of which is a string.
  • Alphabet: Finite non-empty set of symbols \(\Sigma\).
  • A string over an alphabet \(\Sigma\) is a finite sequence of symbols from \(\Sigma\), denoted as \(s = s_1 s_2 \ldots s_n\) where \(s_i \in \Sigma\).
  • A string can be called a word.
  • \(|s|\) is the length of the string.
  • Only one string of length 0, the empty string \(\epsilon\).
  • New strings are formed by concatenation, e.g., \(s_1 s_2\) is a string of length \(|s_1| + |s_2|\), which is an associative operation.
  • Concatenation with \(\epsilon\) is the identity operation, aka \(\epsilon\) is the unit and the operation is monoidal.
  • Kleene star: \(\Sigma^* = \bigcup_{n=0}^\infty \Sigma^n\) where \(\Sigma^n = \Sigma \times \Sigma \times \ldots \times \Sigma\) (\(n\) times).
  • We also define \(\Sigma^+ = \bigcup_{n=1}^\infty \Sigma^n\), which is the set of all non-empty strings over \(\Sigma\).
  • We also define the set of all infinite strings over \(\Sigma\) as \(\Sigma^\infty = \Sigma \times \Sigma \times \ldots\) (infinitely many times).
  • In computer science (CS), strings are usually finite (not everywhere).
  • Kleene star contains all finite strings, including the empty string, and is countably infinite.
  • The set of infinite strings is uncountable for alphabets with more than one symbol.
  • Definition: Language. Let \(\Sigma\) be an alphabet, then a language over \(\Sigma\) is a subset of \(\Sigma^*\).
  • If not specified, we usually assume the language is the whole Kleene star.
  • Subsequence: A string \(s\) is a subsequence of a string \(t\) if we can obtain \(s\) by deleting some symbols from \(t\) without changing the order of the remaining symbols.
  • Substring: A contiguous subsequence.
  • Prefix and suffix are special cases of substrings that share the start or end of the string.
  • Prefix notation: \(y_{<n}\) is a prefix of length \(n-1\) of a string \(y\).
  • \(y \triangleleft y'\) to denote that \(y\) is a suffix of \(y'\).

Definition of a Language Model

  • Now we can write the definition of a language model.
  • Definition: \(\Sigma\) is an alphabet, then a language model is a discrete distribution \(p_\mathrm{LM}\) over \(\Sigma^*\).
  • Definition: Weighted language is \(L(p_\mathrm{LM}) = \{(s, p_\mathrm{LM}(s)) \mid s \in \Sigma^* \}\).
  • We haven’t yet said how to represent such a distribution.

Global and Local Normalization

  • By definition, a language model is always tight.
  • So by non-tight language models that appear often in practice, we mean that the distribution is not normalized, so it’s technically not a language model, and “tight” acts as a non-intersective adjective.
  • These non-tight models do appear, for example, from recurrent neural networks (RNNs) trained from data.
  • Depending on whether we model the probability of the whole string or the next symbol, we can have global or local normalization.
  • BOS token is a special token that indicates the start of a sequence, so it’s gonna be \(y_0\).
  • Augmented alphabet: \(\bar{\Sigma} = \Sigma \cup \{\mathrm{EOS}\}\).
  • And also Kleene star of this set is \(\bar{\Sigma}^*\).
  • Models that are leaking probability mass to infinite sequences will be called sequence models.

Globally Normalized Language Model or Energy Model

  • Definition: Energy function \(\hat{p}: \Sigma^* \to \mathbb{R}\).
  • Inspired from statistical physics, an energy function can be used to define a probability distribution by exponentiating its negative value.
  • Definition: Globally normalized language model is a function \(p_\mathrm{LM}: \Sigma^* \to [0, 1]\) such that: \[p_\mathrm{LM}(s) = \frac{e^{-\hat{p}(s)}}{\sum_{s' \in \Sigma^*} e^{-\hat{p}(s')}}.\]
  • They are easy to define, but hard to compute the normalization constant.
  • Definition: Normalizable energy function has a finite normalizer; if not, then it’s ill-defined.
  • Any normalizable energy function induces a language model over \(\Sigma^*\).
  • In general, it’s undecidable if it’s normalizable; even if so, it can be computationally impossible; it’s also hard to sample from the distribution.

Locally Normalized Models

  • We have to know when the termination ends: \(\bar{\Sigma} = \Sigma \cup \{\mathrm{EOS}\}\).
  • Definition: (Locally normalized) sequence model is a function \(p_\mathrm{SM}\) over an alphabet \(\Sigma\) is a set of conditional probabilities.
  • \(p_\mathrm{SM}(y \mid \mathbf{y})\) where \(\mathbf{y} \in \Sigma^*\) is the history.
  • SM can be also over \(\bar{\Sigma}\).
  • Definition: Locally normalized language model is a sequence model over \(\bar{\Sigma}\) such that: \[p_\mathrm{LM}(\mathbf{y}) = p_\mathrm{SM}(\mathrm{EOS} \mid \mathbf{y}) \prod_{i=1}^{T} p_\mathrm{SM}(y_i \mid y_{<i}),\] for all \(\mathbf{y} \in \Sigma^*\), with \(|\mathbf{y}| = T\).
  • The model is tight if \(1 = \sum_{\mathbf{y} \in \Sigma^*} p_\mathrm{LM}(\mathbf{y})\).
  • So again, an LN LM isn’t necessarily an LM; it’s about how we can write it.
  • When can an LM be locally normalized?
  • Answer: Every model can be locally normalized.
  • Definition: Prefix probability. \(\pi(\mathbf{y}) = \sum_{\mathbf{y}' \in \Sigma^*} p_\mathrm{LM}(\mathbf{y} \mathbf{y}')\).
  • Probability that \(\mathbf{y}\) is a prefix of any string in the language, the cumulative probability of all strings that start with that prefix.
  • Theorem: Let \(p_\mathrm{LM}\) be a language model, then there exists an LN model such that for all \(\mathbf{y} \in \Sigma^*\), \(p_\mathrm{LM} = p_\mathrm{LN}(\mathbf{y}) = p_\mathrm{SM}(\mathrm{EOS} \mid \mathbf{y}) \prod_{i=1}^{T} p_\mathrm{SM}(y_i \mid y_{<i})\).
  • Sketch of proof: We define \(p_\mathrm{SM}(y \mid \mathbf{y}) = \frac{\pi(\mathbf{y} y)}{\pi(\mathbf{y})}\) and \(p_\mathrm{SM}(\mathrm{EOS} \mid \mathbf{y}) = \frac{p_\mathrm{LM}(\mathbf{y})}{\pi(\mathbf{y})}\).
  • When we then write \(p_\mathrm{LN}(\mathbf{y})\) for \(\pi(\mathbf{y}) > 0\), we get that it is the same as \(p_\mathrm{LM}\).
  • And for \(\pi(\mathbf{y}) = 0\), we have \(p_\mathrm{LN}(\mathbf{y}) = 0\), so it’s well-defined.
  • LNMs are very common in natural language processing (NLP) even though they are not necessarily language models.