Part 1 - Week 2

Ryan Cotterell
Published

Tuesday, February 25, 2025

Tight Language Models

  • Locally normalized language models can be non-tight, leaking probability mass to infinite sequences.
  • The sum over all sequences of the LN model may not sum to one, but each conditional of \(p_\mathrm{SM}\) is normalized.
  • Tight is sometimes called consistent or proper.

  • In the example (a) this happens since the only sequences that end with EOS are those that don’t produce \(b\) (description of non-tight 2-gram model where sum = 1/3 <1).
  • It is related to the fact that the probability of EOS is 0 for some sequences, but making \(p(\mathrm{EOS} \mid y_{<i}) > 0\) isn’t sufficient nor necessary.

Defining the Probability Measure of an LNM

  • We want to investigate what \(p_\mathrm{LN}\) is when it is not an LM.
  • We have to be careful since the set of infinite sequences is uncountable.
  • We can define them as probability measures over the union of the set of finite and infinite sequences.
  • Definition: Sequence model is a probability space over \(\Sigma^* \cup \Sigma^\infty\).
  • Definition: Language model is a sequence model over \(\Sigma^*\) or an SM which \(\mathbb{P}(\Sigma^\infty) = 0\).

Defining an Algebra over \(\bar{\Sigma}^\infty\) (Step 1)

  • An LNM produces conditionals over the augmented alphabet \(\bar{\Sigma}\).
  • We will construct a sigma-algebra over \(\bar{\Sigma}^\infty\) (includes both finite and infinite sequences with EOS multiple times).
  • Then, based on \(p_\mathrm{LN}\), we will find its pre-measure that is consistent with \(p_\mathrm{LN}\).
  • Definition: Cylinder set of rank \(k\), set of infinite strings that share their \(k\)-prefix where \(\mathcal{H} \subseteq \bar{\Sigma}^k\). \[\bar{C}(\mathcal{H}) \stackrel{\mathrm{def}}{=} \{\mathbf{y} \boldsymbol{\omega} : \mathbf{y} \in \mathcal{H}, \boldsymbol{\omega} \in \bar{\Sigma}^\infty\}.\]
  • Collection of all rank \(k\) cylinder sets is denoted as \(\mathcal{C}_k = \{\bar{C}(\mathcal{H}) \mid \mathcal{H} \in \mathcal{P}(\bar{\Sigma}^k)\}\).
  • This is a set of sets, so the elements of this set are cylinder sets.
  • This sequence is increasing, so \(\mathcal{C}_1 \subseteq \mathcal{C}_2 \subseteq \ldots\).
  • Definition: The collection of all cylinder sets on \(\Omega\) is \(\mathcal{C} = \bigcup_{k=1}^{\infty} \mathcal{C}_k\).
  • This is an algebra over \(\bar{\Sigma}^\infty\).

Defining a Pre-measure (Step 2)

  • We are constructing a pre-measure \(\mathbb{P}_0\) for the cylinder set \(\bar{C}\) given an LNM \(p_\mathrm{LN}\), and any cylinder \(\bar{C}(\mathcal{H}) \in \mathcal{C}\). \[\mathbb{P}_0(\bar{C}(\mathcal{H})) \stackrel{\mathrm{def}}{=} \sum_{\bar{\mathbf{y}} \in \mathcal{H}} p_\mathrm{LN}(\bar{\mathbf{y}}).\]
  • We measure this cylinder by summing over all the finite sequences that generate it.
  • Where \(p_\mathrm{LN}(\bar{\mathbf{y}}) = \prod_{i=1}^{T} p_\mathrm{LN}(\bar{y}_i \mid \bar{\mathbf{y}}_{<i})\).
  • The same cylinder set can have different generator sets, so we have to prove that the function will have the same value for all of them.
  • The proofs are skipped; this is just an outline.

Extending Pre-measure into a Measure (Step 3)

  • For this, we will use the Carathéodory extension theorem.
  • Theorem: Given an algebra \(\mathcal{A}\) over a set \(\Omega\), if \(\mathbb{P}_0\) is a pre-measure on \(\mathcal{A}\), then there exists a probability space \((\Omega, \mathcal{F}, \mathbb{P})\) such that \(\mathcal{A} \subseteq \mathcal{F}\) and \(\mathbb{P}|_{\mathcal{A}} = \mathbb{P}_0\). Furthermore, the sigma-algebra \(\mathcal{F}\) depends only on \(\mathcal{A}\) and is minimal and unique, which we denote \(\sigma(\mathcal{A})\), and the probability measure is unique.
  • Using this, we get the space \((\bar{\Sigma}^\infty, \sigma(\mathcal{C}), \mathbb{P})\).

Defining a Sequence Model (Step 4)

  • We have created this measure on the extended alphabet, which allows us to measure sequences that contain EOS multiple times.
  • We can define a random variable (RV) to map into EOS-free sequences.
  • RV is a measurable function between two sigma-algebras.
  • We want to map to outcome space \(\Sigma^\infty \cup \Sigma^*\).
  • We can do this in the same way as before using cylinder sets where we use prefixes from \(\Sigma^*\). \[C(\mathcal{H}) \stackrel{\mathrm{def}}{=} \{\mathbf{y} \boldsymbol{\omega} : \mathbf{y} \in \mathcal{H}, \boldsymbol{\omega} \in \Sigma^* \cup \Sigma^{\infty}\}.\]
  • Notice that now the elements of the cylinder can be also finite.
  • We can define the RV \[\mathrm{x}: (\bar{\Sigma}^{\infty}, \sigma(\bar{\mathcal{C}})) \rightarrow (\Sigma^* \cup \Sigma^{\infty}, \sigma(\mathcal{C})),\] which is a measurable mapping \[\mathrm{x}(\boldsymbol{\omega})= \begin{cases} \boldsymbol{\omega}_{<k} & \text{if $k$ is the first EOS in $\boldsymbol{\omega}$}, \\ \boldsymbol{\omega} & \text{otherwise (if EOS $\notin \boldsymbol{\omega}$)}. \end{cases}\]
  • So this mapping takes an infinite sequence containing EOS and cuts it before the first occurrence, or returns the whole sequence if EOS is not present.
  • So this defines a probability space \((\Sigma^* \cup \Sigma^\infty, \sigma(\mathcal{C}), \mathbb{P}_*)\).
  • We have done it!! Given any LNM, we can construct an associated SM, a probability space over \(\Sigma^* \cup \Sigma^\infty\) where probabilities assigned to finite sequences are the same as in the LNM.

Interpreting the Constructed Probability Space

  • For finite \(\mathbf{y} \in \Sigma^*\), \(\mathbb{P}_*(\mathbf{y}) = p_\mathrm{LN}(\mathrm{EOS} \mid \mathbf{y}) p_\mathrm{LN}(\mathbf{y})\).
  • \(\mathbb{P}_*(\Sigma^\infty)\) is the probability of generating an infinite string.
  • Singletons \(\{\mathbf{y}\}\) and \(\Sigma^\infty\) are measurable.
  • Proposition: A sequence model is tight iff \(\sum_{\mathbf{y} \in \Sigma^*} \mathbb{P}_*(\{\mathbf{y}\}) = 1\).
  • EOS probability emerges as conditional probability of exact string given prefix.

\[ \begin{aligned} p_{\mathrm{LN}}(\mathrm{EOS} \mid \boldsymbol{y}) & =\frac{\mathbb{P}_*(\mathrm{x}=\boldsymbol{y})}{p_{\mathrm{LN}}(\boldsymbol{y})} \\ & =\frac{\mathbb{P}_*(\mathrm{x}=\boldsymbol{y})}{\mathbb{P}_*(\mathrm{x} \in \mathcal{C}(\boldsymbol{y}))} \\ & =\mathbb{P}_*(\mathrm{x}=\boldsymbol{y} \mid \mathrm{x} \in \mathcal{C}(\boldsymbol{y})) \end{aligned} \]

Characterizing Tightness

  • Event \(A_k = \{\boldsymbol{\omega} \in \bar{\Sigma}^\infty : \omega_k = \mathrm{EOS}\}\).

  • Infinite strings: \(\bigcap_{k=1}^\infty A_k^c\).

  • Tight if \(\mathbb{P}(\bigcap_{k=1}^\infty A_k^c) = 0\).

  • Define \(\tilde{p}_\mathrm{EOS}(t) = \mathbb{P}(A_t \mid \bigcap_{m=1}^{t-1} A_m^c) = \frac{\sum_{\boldsymbol{\omega} \in \Sigma^{t-1}} p_\mathrm{LN}(\boldsymbol{\omega}) p_\mathrm{LN}(\mathrm{EOS} \mid \boldsymbol{\omega})}{\sum_{\boldsymbol{\omega} \in \Sigma^{t-1}} p_\mathrm{LN}(\boldsymbol{\omega})}\).

  • Theorem (sufficient and necessary condition for tightness): An LNM is tight iff \(\tilde{p}_\mathrm{EOS}(t) = 1\) for some \(t\) or \(\sum_{t=1}^\infty \tilde{p}_\mathrm{EOS}(t) = \infty\).

    • Proof: Uses product of conditionals equaling intersection probability; applies Corollary from Borel-Cantelli for series divergence.
  • Borel-Cantelli Lemmas:

    • I: If \(\sum \mathbb{P}(A_n) < \infty\), then \(\mathbb{P}(A_n\) i.o.) = 0.
    • II: If independent and \(\sum \mathbb{P}(A_n) = \infty\), then \(\mathbb{P}(A_n\) i.o.) = 1.
    • where i.o. means infinitely often which is the same as \(\bigcap_{n=1}^\infty \bigcup_{k=n}^\infty A_k\).
  • Corollary: \(\prod_{n=1}^\infty (1 - p_n) = 0 \iff \sum_{n=1}^\infty p_n = \infty\).

  • Theorem (sufficient condition for tightness): If \(p_\mathrm{LN}(\mathrm{EOS} \mid \mathbf{y}) \geq f(t)\) for all \(\mathbf{y} \in \Sigma^t\) and \(\sum_{t=1}^\infty f(t) = \infty\), then tight.

  • Lower bound via conditional probabilities summing to infinity implies zero mass on infinite intersection. In other words, if the probability of EOS given a prefix is bounded from below by a function that diverges, then the model is tight.

  • \(\tilde{p}_\mathrm{EOS}(t)\) often intractable due to partition function and therefore the lower bound is used in practice.