Part 1 - Week 4

Ryan Cotterell

Published

Tuesday, March 11, 2025

Classical Language Models

This chapter introduces classical language modeling frameworks, focusing on finite-state language models (a generalization of n-gram models). These older approaches distill key concepts due to their simplicity and serve as baselines for modern neural architectures discussed in Chapter 5. Key questions: How can we tractably represent conditional distributions \(p_{SM}(y \mid \mathbf{y})\)? How can we represent hierarchical structures in language?

Finite-State Language Models

Finite-state language models generalize n-gram models and are based on finite-state automata (FSA), which distinguish a finite number of contexts when modeling conditional distributions \(p_M(y \mid \mathbf{y})\).
Informal Definition: A language model \(p_{LM}\) is finite-state if it defines only finitely many unique conditional distributions \(p_{LM}(y \mid \mathbf{y})\).
This bounds the number of distributions to learn but is insufficient for human language; still useful as baselines and theoretical tools.

Finite-State Automata (FSA)

Definition: An FSA is a 5-tuple \((\Sigma, Q, I, F, \delta)\), where:
- \(\Sigma\) is the alphabet.
- \(Q\) is a finite set of states.
- \(I \subseteq Q\) is the set of initial states.
- \(F \subseteq Q\) is the set of final states.
- \(\delta \subseteq Q \times (\Sigma \cup \{\epsilon\}) \times Q\) is a finite multiset of transitions (notation: \(q_1 \xrightarrow{a} q_2\) for \((q_1, a, q_2) \in \delta\); multiset allows duplicates).
Graphically: Labeled directed multigraph; vertices = states; edges = transitions; initial states marked by incoming arrow; final states by double circle.
FSA reads input string \(y \in \Sigma^*\) sequentially, transitioning per \(\delta\); starts in \(I\); \(\epsilon\)-transitions consume no input.
Non-determinism: Multiple transitions possible; take all in parallel.
Deterministic FSA: No \(\epsilon\)-transitions; at most one transition per \((q, a) \in Q \times \Sigma\); single initial state.
Deterministic and non-deterministic FSAs are equivalent (can convert between them).
Accepts \(y\) if ends in \(F\) after reading \(y\).
Language: \(L(\mathcal{A}) = \{ y \mid y\) recognized by \(\mathcal{A} \}\).
Regular Language: \(L \subseteq \Sigma^*\) is regular iff recognized by some FSA.

Weighted Finite-State Automata (WFSA)

Augment FSA with real-valued weights.
Definition: WFSA \(\mathcal{A} = (\Sigma, Q, \delta, \lambda, \rho)\), where:
- \(\delta \subseteq Q \times (\Sigma \cup \{\epsilon\}) \times \mathbb{R} \times Q\) (notation: \(q \xrightarrow{a/w} q'\)).
- \(\lambda: Q \to \mathbb{R}\) (initial weights).
- \(\rho: Q \to \mathbb{R}\) (final weights).
Implicit initials: \(\{q \mid \lambda(q) \neq 0\}\); finals: \(\{q \mid \rho(q) \neq 0\}\).
Transition Matrix: \(T(a)_{ij} =\) sum of weights from \(i\) to \(j\) labeled \(a\); \(T = \sum_{a \in \Sigma} T(a)\).
Path: Sequence of consecutive transitions; \(|\pi| =\) length; \(p(\pi), n(\pi) =\) origin/destination; \(s(\pi) =\) yield.
Path sets: \(\Pi(\mathcal{A}) =\) all paths; \(\Pi(\mathcal{A}, y) =\) yielding \(y\); \(\Pi(\mathcal{A}, q, q') =\) from \(q\) to \(q'\).
Path Weight: Inner \(w_I(\pi) = \prod_{n=1}^N w_n\) for \(\pi = q_1 \xrightarrow{a_1/w_1} q_2 \cdots q_N\); full \(w(\pi) = \lambda(p(\pi)) \cdot w_I(\pi) \cdot \rho(n(\pi))\).
Accepting: \(w(\pi) \neq 0\).
Stringsum: \(\mathcal{A}(y) = \sum_{\pi \in \Pi(\mathcal{A}, y)} w(\pi)\).
Weighted Language: \(L(\mathcal{A}) = \{(y, \mathcal{A}(y)) \mid y \in \Sigma^*\}\); weighted regular if from WFSA.
State Allsum (backward value \(\beta(q)\)): \(Z(\mathcal{A}, q) = \sum_{\pi: p(\pi)=q} w_I(\pi) \cdot \rho(n(\pi))\).
Allsum: \(Z(\mathcal{A}) = \sum_y \mathcal{A}(y) = \sum_\pi w(\pi)\); normalizable if finite.
Accessibility: \(q\) accessible if non-zero path from initial; co-accessible if to final; useful if both.
Trim: Remove useless states (preserves non-zero weights).
Probabilistic WFSA (PFSA): \(\sum_q \lambda(q)=1\); weights \(\geq 0\); for each \(q\), \(\sum w + \rho(q)=1\) over outgoing and final.
Final weights replace EOS; corresponds to locally normalized models.

Finite-State Language Models (FSLM)

Definition: \(p_{LM}\) finite-state if \(L(p_{LM}) = L(\mathcal{A})\) for some WFSA (weighted regular language).
Path/String Prob in PFSA: Path prob = weight; string prob \(p_{\mathcal{A}}(y) = \mathcal{A}(y)\) (local; may not be tight).
String Prob in WFSA: \(p_{\mathcal{A}}(y) = \mathcal{A}(y) / Z(\mathcal{A})\) (global, tight if normalizable, non-negative).
Induced LM: \(p_{LM^{\mathcal{A}}}(y) = p_{\mathcal{A}}(y)\).

Normalizing Finite-State Language Models

Pairwise pathsums: \(M_{ij} = \sum_{\pi: i \to j} w_I(\pi)\); \(Z(\mathcal{A}) = \vec{\lambda} M \vec{\rho}\).
- Proof: Group paths by origin/destination; factor \(\lambda(i), \rho(j)\).
\(T^d_{ij} = \sum w_I(\pi)\) over length-\(d\) paths \(i \to j\) (induction).
\(T^* = \sum_{d=0}^\infty T^d = (I - T)^{-1}\) if converges; \(M = T^*\).
Runtime \(O(|Q|^3)\); Lehmann’s algorithm special case.
Converges iff spectral radius \(\rho_s(T) < 1\) (eigenvalues \(<1\) in magnitude).
Speed-ups: Acyclic = linear in transitions (Viterbi); sparse components combine algorithms.
Differentiable for gradient-based training.

Locally Normalizing a Globally Normalized Finite-State Language Model

Instance of weight pushing; computes state allsums, reweights transitions.
Theorem: Normalizable non-negative WFSA and tight PFSA equally expressive.
- Proof (\(\Leftarrow\)): Trivial (tight PFSA is WFSA with \(Z=1\)).
- (\(\Rightarrow\)): For trim \(\mathcal{A}_G\), construct \(\mathcal{A}_L\) with same structure; reweight initials \(\lambda_L(q) = \lambda(q) Z(\mathcal{A}, q) / Z(\mathcal{A})\); finals \(\rho_L(q) = \rho(q) / Z(\mathcal{A}, q)\); transitions \(w_L = w \cdot Z(\mathcal{A}, q') / Z(\mathcal{A}, q)\).
- Non-negative; sum to 1 (by allsum recursion lemma).
- Path probs match: Internal allsums cancel; leaves \(w(\pi)/Z(\mathcal{A})\).
Lemma (Allsum Recursion): \(Z(\mathcal{A}, q) = \sum_{q \xrightarrow{a/w} q'} w \cdot Z(\mathcal{A}, q') + \rho(q)\).
- Proof: Partition paths by first transition; recurse on tails.

Defining a Parametrized Globally Normalized Language Model

Parametrize transitions via score \(f_\theta^\delta: Q \times \Sigma \times Q \to \mathbb{R}\); initials \(f_\theta^\lambda: Q \to \mathbb{R}\); finals \(f_\theta^\rho: Q \to \mathbb{R}\).
Defines \(\mathcal{A}_\theta\); prob \(p(y) = \mathcal{A}_\theta(y) / Z(\mathcal{A}_\theta)\); energy \(E(y) = -\log \mathcal{A}_\theta(y)\) for GN form.
\(f_\theta\) can be neural net/lookup; states encode context (e.g., n-grams).
Trainable via differentiable allsum/stringsum.

Tightness of Finite-State Models

Globally normalized finite-state language models are tight by definition, as probabilities sum to 1 over all finite strings.
Focus on locally normalized finite-state models, which are exactly probabilistic weighted finite-state automata (PFSA).
Theorem: A PFSA is tight if and only if all accessible states are also co-accessible.
- Proof (\(\Rightarrow\)): Assume tight. Let \(q \in Q\) be accessible (reachable with positive probability). By tightness, there must be positive probability path from \(q\) to termination; otherwise, non-tight after reaching \(q\). Thus, \(q\) co-accessible.
- (\(\Leftarrow\)): Assume all accessible states co-accessible. Consider Markov chain on accessible states \(Q_A \subseteq Q\) (others have probability 0). EOS is absorbing and reachable from all (by co-accessibility). Suppose another absorbing set distinct from EOS; it cannot reach EOS, contradicting co-accessibility. Thus, EOS unique absorbing state; process absorbed with probability 1. Hence, tight.
Trimming a PFSA may violate sum=1 condition: \(\rho(q) + \sum_{q \xrightarrow{a/w} q'} w \leq 1\); such are substochastic WFSA.
Definition (Substochastic WFSA): WFSA where \(\lambda(q) \geq 0\), \(\rho(q) \geq 0\), \(w \geq 0\), and for all \(q\), \(\rho(q) + \sum w \leq 1\).
Theorem: Let \(T'\) be transition matrix of trimmed substochastic WFSA. Then \(I - T'\) invertible; termination probability \(p(\Sigma^*) = \vec{\lambda}' (I - T')^{-1} \vec{\rho}' \leq 1\).
- Spectral Radius: \(\rho_s(M) = \max \{|\lambda_i|\}\) over eigenvalues \(\lambda_i\).
- Lemma: \(\rho_s(T') < 1\).
  - Proof: Assume Frobenius normal form \(T' =\) block upper triangular with irreducible blocks \(T'_k\). Each \(T'_k\) substochastic and strictly so (trimmed implies no probabilistic block without finals, contradicting co-accessibility). By Proposition 4.1.1, \(\rho_s(T'_k) < 1\) (row sums <1 or mixed < max row sum). Thus, \(\rho_s(T') = \max \rho_s(T'_k) < 1\).
- Proof: \(\rho_s(T') < 1\) implies \(I - T'\) invertible; Neumann series \((I - T')^{-1} = \sum_{k=0}^\infty (T')^k\) converges. Then \(p(\Sigma^*) = \sum_{k=0}^\infty P(\Sigma^k) = \sum_{k=0}^\infty \vec{\lambda}' (T')^k \vec{\rho}' = \vec{\lambda}' (I - T')^{-1} \vec{\rho}' \leq 1\) (substochastic).

The n-gram Assumption and Subregularity

n-gram models are a special case of finite-state language models; all results apply.
Modeling all conditional distributions \(p_{SM}(y_t \mid \mathbf{y}_{<t})\) is intractable (infinite histories as \(t \to \infty\)).
Assumption (n-gram): Probability of \(y_t\) depends only on previous \(n-1\) symbols: \(p_{SM}(y_t \mid \mathbf{y}_{<t}) = p_{SM}(y_t \mid \mathbf{y}_{t-n+1}^{t-1})\).
- Alias: \((n-1)\)th-order Markov assumption in language modeling.
Handle edges (\(t < n\)): Pad with BOS symbols, transforming \(\mathbf{y}_{1:t} \to\) BOS repeated \(n-1-t\) times + \(\mathbf{y}_{1:t}\); assume \(p_{SM}\) ignores extra BOS.
Limits unique distributions to \(O(|\Sigma|^{n-1})\); dependencies span at most \(n\) tokens; parallelizable (e.g., like transformers).
Historical significance: Shannon (1948), Baker (1975), Jelinek (1976), etc.; baselines for neural models.
WFSA Representation of n-gram Models: n-gram LMs are finite-state (finite histories).
- Alphabet \(\Sigma\) (exclude EOS; use finals).
- States \(Q = \bigcup_{t=0}^{n-1} \{\text{BOS}\}^{n-1-t} \times \Sigma^t\).
- Transitions: From history \(\mathbf{y}_{t-n:t-1}\) to \(\mathbf{y}_{t-n+1:t}\) on \(y_t\) with weight \(p_{SM}(y_t \mid \mathbf{y}_{t-n:t-1})\).
- Initial: \(\lambda(\text{BOS}^{n-1}) = 1\), others 0.
- Final: \(\rho(\mathbf{y}) = p_{SM}(\text{EOS} \mid \mathbf{y})\).
- Proof of equivalence left to reader; models up to first EOS.
Parametrized WFSA for n-gram: Parametrize scores for “suitability” of n-grams; globally normalize; fit via training (e.g., §3.2.3).
Subregularity: Subregular languages recognized by FSA or weaker; hierarchies within regular languages.
n-gram models are strictly local (SL), simplest non-finite subregular class (patterns on consecutive blocks independently).
Definition (Strictly Local): Language \(L\) is \(SL_n\) if for every \(|\mathbf{y}|=n-1\), and all \(\mathbf{x}_1, \mathbf{x}_2, \mathbf{z}_1, \mathbf{z}_2\), if \(\mathbf{x}_1 \mathbf{y} \mathbf{z}_1 \in L\) and \(\mathbf{x}_2 \mathbf{y} \mathbf{z}_2 \in L\), then \(\mathbf{x}_1 \mathbf{y} \mathbf{z}_2 \in L\) (and symmetric). SL if \(SL_n\) for some n.
- Matches n-gram: History >n irrelevant.

Representation-Based n-gram Models

Simple: Parametrize \(\theta = \{\zeta_{y'|\mathbf{y}} = p_{SM}(y' \mid \mathbf{y}) \mid |\mathbf{y}|=n-1, y' \in \Sigma\}\), non-negative, sum to 1 per context.
MLE: \(\zeta_{y_n | \mathbf{y}_{<n}} = C(\mathbf{y}) / C(\mathbf{y}_{<n})\) if defined (\(C=\) corpus counts).
- Proof: Log-likelihood \(\ell(\mathcal{D}) = \sum_{|\mathbf{y}|=n} C(\mathbf{y}) \log \zeta_{y_n | \mathbf{y}_{<n}}\) (token-to-type switch); KKT yields solution (independent per context).
Issues: No word similarity (e.g., “cat” unseen but similar to “kitten”, “puppy”); no generalization across paraphrases.
Neural n-gram (Bengio et al., 2003): Use embeddings \(E \in \mathbb{R}^{|\Sigma| \times d}\); encoder \(\text{enc}(\mathbf{y}_{t-n+1}^{t-1}) = b + W x + U \tanh(d + H x)\), \(x =\) concat\((E[y_{t-1}], \dots, E[y_{t-n+1}])\).
\(p_{SM}(y_t \mid \mathbf{y}_{<t}) = \text{softmax}( \text{enc}^T E + b )_{y_t}\); locally normalized model.
Shares parameters via embeddings/encoder; generalizes semantically; parameters linear in \(n\) (vs. \(|\Sigma|^n\)).
Still n-gram (finite context); WFSA-representable with neural-parametrized weights.

Pushdown Language Models

Human language has unbounded recursion (e.g., center embeddings: “The cat the dog the mouse… likes to cuddle”), requiring infinite states; not regular.
Context-free languages (CFL) capture hierarchies; generated by context-free grammars (CFG), recognized by pushdown automata (PDA).
Competence vs. Performance: Theoretical grammar allows arbitrary depth; actual use limits nesting (rarely >3; Miller and Chomsky, 1963).

Context-Free Grammars (CFG)

Definition: 4-tuple \(G = (\Sigma, N, S, P)\); \(\Sigma\) terminals; \(N\) non-terminals (\(N \cap \Sigma = \emptyset\)); \(S \in N\) start; \(P \subseteq N \times (N \cup \Sigma)^*\) productions (\(X \to \alpha\)).
Rule Application: Replace \(X\) in \(\beta = \gamma X \delta\) with \(\alpha\) to get \(\gamma \alpha \delta\) if \(X \to \alpha \in P\).
Derivation: Sequence \(\alpha_1, \dots, \alpha_M\) with \(\alpha_1 \in N\), \(\alpha_M \in \Sigma^*\), each from applying rule to previous.
Derives: \(X \implies_G \beta\) if direct; \(X \overset{*}{\implies}_G \beta\) if transitive closure.
Language: \(L(G) = \{ y \in \Sigma^* \mid S \overset{*}{\implies}_G y \}\).
Parse Tree: Tree with root \(S\), internal nodes non-terminals, children per production RHS; yield \(s(d) =\) leaf concatenation.
Derivation Set: \(D_G(y) = \{d \mid s(d)=y\}\); \(D_G =\) union over \(y \in L(G)\); \(D_G(Y) =\) subtrees rooted at \(Y\) (\(D_G(a)=\emptyset\) for terminal \(a\)).
Ambiguity: Grammar ambiguous if some \(y\) has \(|D_G(y)| >1\); unambiguous if =1.
Example: \(G\) for \(\{a^n b^n \mid n \geq 0\}\); unique trees.
Dyck Languages \(D(k)\): Well-nested brackets of \(k\) types; grammar with \(S \to \epsilon | SS | \langle_n S \rangle_n\).
Reachable/Generating: Symbol reachable if \(S \overset{*}{\implies} \alpha X \beta\); non-terminal generating if derives some \(y \in \Sigma^*\).
Pruned CFG: All non-terminals reachable and generating.

Weighted Context-Free Grammars (WCFG)

Definition: 5-tuple \((\Sigma, N, S, P, W)\); \(W: P \to \mathbb{R}\).
Tree Weight: \(w(d) = \prod_{(X \to \alpha) \in d} W(X \to \alpha)\).
Stringsum: \(G(y) = \sum_{d \in D_G(y)} w(d)\).
Allsum (Non-terminal \(Y\)): \(Z(G, Y) = \sum_{d \in D_G(Y)} w(d)\); terminals \(Z(a)=1\); \(Z(G)=Z(G,S)\); normalizable if finite.
Probabilistic CFG (PCFG): Weights \(\geq 0\); \(\sum_{X \to \alpha} W(X \to \alpha)=1\) per \(X\).
Context-Free LM: \(p_{LM}\) context-free if \(L(p_{LM}) = L(G)\) for some WCFG.
Tree/String Prob in PCFG: Tree prob = weight; string prob \(p_G(y) = G(y)\).
String Prob in WCFG: \(p_G(y) = G(y)/Z(G)\).
Induced LM: \(p_{LM^G}(y) = p_G(y)\).

Tightness of Context-Free Language Models

GN always tight.
Generation Level: \(\gamma_0 = S\); \(\gamma_l =\) apply all productions to non-terminals in \(\gamma_{l-1}\).
Production Generating Function: For \(X_n\), \(g_n(s_1,\dots,s_N) = \sum_{X_n \to \alpha} W(X_n \to \alpha) s_1^{r_1(\alpha)} \cdots s_N^{r_N(\alpha)}\) (\(r_m(\alpha)=\) occurrences of \(X_m\) in \(\alpha\)).
Generating Function: \(G_0(s_1,\dots,s_N)=s_1\) (for \(S=X_1\)); \(G_l(s_1,\dots,s_N) = G_{l-1}(g_1(s),\dots,g_N(s))\).
\(G_l = D_l(s) + C_l\); \(C_l =\) prob of terminating in \(\leq l\) levels.
Lemma: PCFG tight iff \(\lim_{l \to \infty} C_l =1\).
- Proof: Limit is prob of finite strings; <1 implies non-zero infinite prob.
First-Moment Matrix: \(E_{nm} = \partial g_n / \partial s_m |_{s=1} = \sum_{X_n \to \alpha} W(X_n \to \alpha) r_m(\alpha)\) (expected occurrences of \(X_m\) from \(X_n\)).
Theorem: PCFG tight if \(|\lambda_{\max}(E)| <1\); non-tight if \(>1\).
- Proof: Coeff of \(s_1^{r_1} \cdots s_N^{r_N}\) in \(G_l\) is prob of \(r_m\) \(X_m\) at level \(l\). Expected \(\vec{r}_l = \vec{r}_0 E^l = [1,0,\dots,0] E^l\). Tight iff \(\lim \vec{r}_l =0\) iff \(\lim E^l=0\) iff \(|\lambda_{\max}|<1\) (diverges if \(>1\)).
MLE-trained WCFG always tight (Chi, 1999).

Normalizing Weighted Context-Free Grammars (WCFG)

To use a WCFG as a language model, compute stringsum \(G(y)\) and allsum \(Z(G)\) for normalization.
Allsum algorithm derives solutions to nonlinear equations (beyond scope; see Advanced Formal Language Theory for details).
Theorem (Smith and Johnson, 2007): Normalizable WCFG with non-negative weights and tight probabilistic context-free grammars (PCFG) are equally expressive.
- Proof (\(\Leftarrow\)): Tight PCFG is WCFG with \(Z(G)=1\); trivial.
- (\(\Rightarrow\)): Let \(G_G = (\Sigma, N, S, P, W)\) be pruned normalizable non-negative WCFG. Construct \(G_L = (\Sigma, N, S, P, W_L)\) with same structure.
  - \(W_L(X \to \alpha) = W(X \to \alpha) \cdot \prod_{Y \in \alpha} Z(G, Y) / Z(G, X)\) (reweight using allsums).
  - Non-negative (inherits from \(W\)).
  - Sums to 1 per \(X\): By allsum definition \(Z(G, X) = \sum_{X \to \alpha} W(X \to \alpha) \prod_{Y \in \alpha} Z(G, Y)\).
  - Path/tree probs match: In product over tree, internal allsums cancel; leaves \(w(d)/Z(G)\).
- Equivalence via reweighting preserves language.

Single-Stack Pushdown Automata

CFGs generate languages; pushdown automata (PDA) recognize them (decide membership).
PDA extend FSA with a stack for unbounded memory.
Definition: PDA \(\mathcal{P} = (\Sigma, Q, \Gamma, \delta, (q_\iota, \gamma_\iota), (q_\varphi, \gamma_\varphi))\), where:
- \(\Sigma\): Input alphabet.
- \(Q\): Finite states.
- \(\Gamma\): Stack alphabet.
- \(\delta \subseteq Q \times \Gamma^* \times (\Sigma \cup \{\epsilon\}) \times Q \times \Gamma^*\): Transition multiset.
- \((q_\iota, \gamma_\iota)\): Initial configuration (\(q_\iota \in Q\), \(\gamma_\iota \in \Gamma^*\)).
- \((q_\varphi, \gamma_\varphi)\): Final configuration.
Stack as string \(\gamma = X_1 \cdots X_n\) (\(\Gamma^*\); \(X_1\) bottom, \(X_n\) top; \(\emptyset\) empty).
Configuration: Pair \((q, \gamma)\), \(q \in Q\), \(\gamma \in \Gamma^*\).
Transition \(q \xrightarrow{a, \gamma_1 \to \gamma_2} r\): From \(q\), read \(a\) (or \(\epsilon\)), pop \(\gamma_1\) from top, push \(\gamma_2\); go to \(r\).
Scanning: Transition scans \(a\) if reads \(a\); scanning if \(a \neq \epsilon\), else non-scanning.
Moves: If configs \((q_1, \gamma \gamma_1)\), \((q_2, \gamma \gamma_2)\), and transition \(q_1 \xrightarrow{a, \gamma_1 \to \gamma_2} q_2\), write \((q_1, \gamma \gamma_1) \implies (q_2, \gamma \gamma_2)\).
Run: Sequence \((q_0, \gamma_0), t_1, (q_1, \gamma_1), \dots, t_n, (q_n, \gamma_n)\) where each step follows a transition \(t_i\); accepting if starts initial, ends final.
Scans \(y = a_1 \cdots a_n\) if transitions scan \(a_i\)’s.
Sets: \(\Pi(\mathcal{P}, y) =\) accepting runs scanning \(y\); \(\Pi(\mathcal{P}) =\) all accepting runs.
Recognition: PDA recognizes \(y\) if \(\Pi(\mathcal{P}, y) \neq \emptyset\).
Language: \(L(\mathcal{P}) = \{y \mid \Pi(\mathcal{P}, y) \neq \emptyset\}\).
Deterministic PDA: No \(\epsilon\)-transitions reading input; at most one transition per \((q, a, \gamma)\); no \(\epsilon\) if input transition exists.
Non-deterministic PDA more expressive; recognize exactly CFL (Theorem 4.2.3; Sipser, 2013).

Weighted Pushdown Automata

Definition (WPDA): As PDA but \(\delta \subseteq Q \times \Gamma^* \times (\Sigma \cup \{\epsilon\}) \times Q \times \Gamma^* \times \mathbb{R}\).
Finite actions (multiset).
Run Weight: \(w(\pi) = \prod_{i=1}^n w_i\) over transition weights \(w_i\).
Stringsum: \(\mathcal{P}(y) = \sum_{\pi \in \Pi(\mathcal{P}, y)} w(\pi)\).
Recognizes \(y\) with weight \(\mathcal{P}(y)\).
Weighted Language: \(L(\mathcal{P}) = \{(y, \mathcal{P}(y)) \mid y \in \Sigma^*\}\).
Allsum: \(Z(\mathcal{P}) = \sum_{\pi \in \Pi(\mathcal{P})} w(\pi)\); normalizable if finite.
Probabilistic PDA (PPDA): Weights \(\geq 0\); for each config \((q, \gamma)\), sum to 1 over transitions with pop prefix of \(\gamma\).
Theorem: CFL iff PDA-recognizable; PCFG-generated iff PPDA-recognizable (Abney et al., 1999).
Theorem: Globally normalized WPDA can be locally normalized (to equivalent PPDA).
- Proof: Convert to WCFG, locally normalize to PCFG (Theorem 4.2.2), convert back to PPDA (Abney et al., 1999; Butoi et al., 2022).

Pushdown Language Models

Definition: Language model \(p_{LM}\) is pushdown if \(L(p_{LM}) = L(\mathcal{P})\) for some WPDA.
Equivalent to context-free LM for single-stack (by PDA-CFG equivalence).
Induced LM: \(p_{LM^{\mathcal{P}}}(y) = \mathcal{P}(y) / Z(\mathcal{P})\).

Multi-Stack Pushdown Automata

Extend to multiple stacks; 2-stack suffices for max expressivity.
Definition (2-PDA): \(\mathcal{P} = (\Sigma, Q, \Gamma_1, \Gamma_2, \delta, (q_\iota, \gamma_{\iota1}, \gamma_{\iota2}), (q_\varphi, \gamma_{\varphi1}, \gamma_{\varphi2}))\); \(\delta \subseteq Q \times \Gamma_1^* \times \Gamma_2^* \times (\Sigma \cup \{\epsilon\}) \times Q \times \Gamma_1^* \times \Gamma_2^*\).
Configurations \((q, \gamma_1, \gamma_2)\); runs/acceptance analogous.
2-WPDA: Add \(\mathbb{R}\) to \(\delta\) for weights.
Probabilistic 2-PDA (2-PPDA): Weights \(\geq 0\); sum to 1 per config over transitions with pop prefixes.
Theorem: 2-PDA Turing complete (simulate TM tape with stacks; Theorem 8.13, Hopcroft et al., 2006).
Thus, weighted/probabilistic 2-PDA Turing complete; recognize distributions over recursively enumerable languages.
Theorem: Tightness of 2-PPDA undecidable.
- Proof: Tight iff halts with prob 1 on inputs (measure on \(\Sigma^* \cup \Sigma^\infty\)). Simulate TM; equivalent to TM halting with prob 1 (halting problem variant, undecidable).
Example: Poisson over \(\{a^n \mid n \geq 0\}\) not context-free (Icard, 2020).