2 - Probability recap

Author

Peter Nutter

Published

Sunday, April 21, 2024

Discrete vs Continuous Variables

  • In discrete distributions, the difference between \(\leq\) and \(<\) matters.
  • Cumulative Distribution Functions (CDFs) are not used for discrete unordered sets (like “dog”, “cat”, “none”) and for categorical random variables.

Joint and Marginal Distributions

  • The expected value is the integral over the domain: \[ E(g(X)) = \int g(x) p(x) \, dx \]
  • We assume the expected values are finite.

Variance and Standard Deviation

  • Variance: \[ \text{Var}(X) = E((X - \mu)^2) = E(X^2) - \mu^2 \] where \(\mu\) is fixed and not a part of the integral.
  • Standard deviation: \[ \sigma = \sqrt{\text{Var}(X)} \] It has the same units as \(X\).

Covariance and Correlation

  • Covariance: \[ \text{Cov}(X, Y) = E((X - \mu_X)(Y - \mu_Y)) = E(XY) - \mu_X \mu_Y \] where \(\mu_X\) and \(\mu_Y\) are the marginal means.
  • If \(X = Y\), then \(\text{Cov}(X, X) = \text{Var}(X)\).
  • Correlation: \[ \rho_{X,Y} = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} \] with range \((-1, 1)\).

Bayes’ Rule

  • Bayes’ Rule: \[ P(A \cap B) = P(A|B) P(B) = P(B|A) P(A) \]
  • Conditional distribution: \[ f_{X|Y=y}(X) = \frac{f_{X,Y}(X, y)}{f_Y(y)} \]

Independence

  • \(X\) and \(Y\) are independent if and only if for all \(x\) and \(y\), \[ f(x, y) = f(x) f(y) \]
  • This is equivalent to: \[ F(x, y) = F(x) F(y) \]
  • For more dimensions, this extends with a product.

Properties of Expectations

  • If \(X\) and \(Y\) are independent: \[ E(XY) = E(X) E(Y) \]
  • Expectation of conditional expectation: \[ E(X) = E(E(X|Y)) \]

Properties of Variance

  • For constants \(a\) and \(b\): \[ \text{Var}(aX + b) = a^2 \text{Var}(X) \]
  • Variance of the sum: \[ \text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2 \text{Cov}(X, Y) \]
  • If \(X\) and \(Y\) are independent, \(\text{Cov}(X, Y) = 0\).
  • Law of total variance: \[ \text{Var}(X) = E(\text{Var}(X|Y)) + \text{Var}(E(X|Y)) \]

Discrete Distributions

Bernoulli Distribution

  • Mean: \(p\)
  • Variance: \(p(1-p)\)
  • Domain: \(\{0, 1\}\)

Binomial Distribution

  • Probability mass function: \[ P(X = x) = \binom{n}{x} p^x (1-p)^{n-x} \]
  • Domain: \(x \in \{0, \ldots, n\}\)
  • \(X\) counts the number of successes in \(n\) independent Bernoulli trials with probability \(p\).
  • Mean: \(np\)
  • Variance: \(np(1-p)\)

Poisson Distribution

  • Probability mass function: \[ P(X = x) = \frac{e^{-\lambda} \lambda^x}{x!} \]
  • Domain: \(x \in \{0, 1, 2, \ldots\}\)
  • \(X\) counts the number of events occurring in a fixed interval with mean \(\lambda\) and variance \(\lambda\).

Continuous Distributions

Normal Distribution

  • Probability density function: \[ f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x - \mu)^2}{2\sigma^2}} \]
  • Domain: \(\mathbb{R}\)
  • Linear transformation of a normal variable is normal.
  • Standard normal variable: \[ Z = \frac{X - \mu}{\sigma} \sim N(0, 1) \]
  • Sum of independent normal variables: \[ \sum_{i=1}^n X_i \sim N\left(n\mu, n\sigma^2\right) \]
  • Mean and variance of sample mean of \(n\) i.i.d. normal variables: \[ \bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right) \]

Gamma Distribution

  • Probability density function with shape \(\alpha\) and rate \(\beta\): \[ f(x; \alpha, \beta) = \frac{\beta^\alpha x^{\alpha-1} e^{-\beta x}}{\Gamma(\alpha)} \]
  • Probability density function with shape \(\alpha\) and scale \(\theta\): \[ f(x; \alpha, \theta) = \frac{x^{\alpha-1} e^{-x/\theta}}{\theta^\alpha \Gamma(\alpha)} \]
  • Domain: \(x \geq 0\)
  • The exponential distribution is a gamma distribution with shape \(\alpha = 1\).

Exponential Distribution

  • Probability density function with rate \(\lambda\): \[ f(x; \lambda) = \lambda e^{-\lambda x} \]
  • Domain: \(x \geq 0\)

Chi-Squared Distribution

  • A chi-squared distribution with \(k\) degrees of freedom is a gamma distribution with shape \(k/2\) and rate \(1/2\).
  • Sum of independent chi-squared variables is chi-squared with the sum of the degrees of freedom.

t-Distribution

  • \(t\)-distribution with \(v\) degrees of freedom: \[ T = \frac{Z}{\sqrt{V/v}} \] where \(Z \sim N(0, 1)\) and \(V \sim \chi^2(v)\).
  • The \(t\)-distribution has heavier tails than the normal distribution for small degrees of freedom and converges to the normal distribution as \(v\) increases.

Log-Normal Distribution

  • If \(\log(X) \sim N(\mu, \sigma^2)\), then \(X\) follows a log-normal distribution.
  • Probability density function: \[ f(x; \mu, \sigma) = \frac{1}{x \sigma \sqrt{2\pi}} e^{-\frac{(\log x - \mu)^2}{2\sigma^2}} \]

Beta Distribution

  • Probability density function with parameters \(\alpha\) and \(\beta\): \[ f(x; \alpha, \beta) = \frac{x^{\alpha-1} (1-x)^{\beta-1}}{B(\alpha, \beta)} \] where \(B(\alpha, \beta)\) is the beta function.
  • Alternative parameterization using shape parameters: \[ f(x; \text{shape1}, \text{shape2}) = \frac{x^{\text{shape1}-1} (1-x)^{\text{shape2}-1}}{B(\text{shape1}, \text{shape2})} \]
  • Domain: \(0 < x < 1\)
  • For \(\alpha = 1\) and \(\beta = 1\), the beta distribution is the uniform distribution on \([0, 1]\).

Multivariate Normal Distribution

Definition

  • The multivariate normal distribution in 2 dimensions is given by: \[ \mathbf{X} \sim N(\mathbf{\mu}, \mathbf{\Sigma}) \] where \[ \mathbf{X} = \begin{pmatrix} X_1 \\ X_2 \end{pmatrix}, \quad \mathbf{\mu} = \begin{pmatrix} \mu_1 \\ \mu_2 \end{pmatrix}, \quad \mathbf{\Sigma} = \begin{pmatrix} \sigma_{11} & \sigma_{12} \\ \sigma_{21} & \sigma_{22} \end{pmatrix} \]

Covariance Matrix

  • The covariance matrix \(\mathbf{\Sigma}\) looks like: \[ \mathbf{\Sigma} = \begin{pmatrix} \sigma_{11} & \sigma_{12} \\ \sigma_{12} & \sigma_{22} \end{pmatrix} \]
    • If the covariance matrix is diagonal, then the variables are independent and separable.

Properties

  • The covariance matrix is always symmetric.
  • Linear transformations of a multivariate normal variable are still multivariate normal: \[ \mathbf{Y} = \mathbf{C} \mathbf{X} + \mathbf{b} \sim N(\mathbf{C} \mathbf{\mu} + \mathbf{b}, \mathbf{C} \mathbf{\Sigma} \mathbf{C}^T) \]
  • The dimension can change under linear transformations.

Limit Theorems

  • Let \(X_1, X_2, \ldots, X_n\) be i.i.d. with finite mean \(\mu\) and variance \(\sigma^2\): \[ E|X| < \infty \] then the mean of \(X_i\): \[ \bar{X}_n \text{ converges to } \mu \text{ almost surely (strong law of large numbers)} \]
  • For binary \(X_i\), \(\bar{X}_n\) reduces to the sample proportion \(p\). As \(n \to \infty\): \[ p \sim N\left(p, \frac{p(1-p)}{n}\right) \]
  • Standard error of the sample mean: \[ \frac{\sigma}{\sqrt{n}} \]

Statistics

  • Big \(X\) represents a random variable.
  • Small \(x\) represents observed data.
  • A statistic is a function \(T_n(X_1, X_2, \ldots, X_n)\).
  • The empirical distribution function (ecdf) is: \[ T_n = F_n(x) = \frac{1}{n} \sum_{i=1}^{n} I(X_i \leq x) \] which estimates \(F(x) = P(X \leq x)\).
  • Sample error: \[ \sqrt{\frac{F(x)(1-F(x))}{n}} \text{ with upper bound } \frac{0.5}{\sqrt{n}} \]
  • A quantile is the inverse of the CDF: \[ x_q = F^{-1}(q) \]
    • \(x_q\) is not unique.
    • \(x_q = \min \{x: F(x) \geq q\}\)
    • \(q \in [0, 1]\) but \(x_q\) is in the domain.
  • Standard error for quantiles: \[ \text{se}(x_q) = \sqrt{\frac{q(1-q)}{n f^2(x_q)}} \] where \(f(x_q)\) is the density at \(x_q\).

Bias

  • A statistic \(T_n\) is an unbiased estimator of \(\theta\) if: \[ E(T_n) = \theta \]
  • It is asymptotically unbiased if: \[ \lim_{n \to \infty} E(T_n) = \theta \]
  • Sample variance is unbiased, but variance is biased.
  • Bias: \[ \text{Bias} = E(T_n) - \theta \]
  • Mean Squared Error (MSE): \[ \text{MSE} = E((T_n - \theta)^2) = \text{variance} + \text{bias}^2 \]
  • Standard error of estimated properties: \[ \text{se}(\hat{p}) = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \]

Maximum Likelihood Estimation (MLE)

  • Probability of the observations viewed as a function of the parameter:
    • If \(X_1, X_2, \ldots, X_n\) are i.i.d.: \[ L(\theta) = f(x_1, x_2, \ldots, x_n; \theta) = \prod_{i=1}^{n} f(x_i; \theta) \]
  • The MLE is the \(\theta\) that maximizes this.
  • MLE is usually done in the log form.
  • MLEs are invariant, meaning if \(\theta\) is the MLE of \(\theta\), then \(g(\theta)\) is the MLE of \(g(\theta)\).

Bayes’ Theorem

  • Law of total probability:
    • If events \(A_1, A_2, \ldots, A_n\) are disjoint and cover the sample space, then: \[ P(B) = \sum_{i=1}^{n} P(A_i)P(B|A_i) \]
  • Bayes’ theorem: \[ P(A|B) = \frac{P(B|A)P(A)}{P(B)} \]
  • For continuous random variables: \[ f(y|x) = \frac{f(x|y)f(y)}{f(x)} = \frac{f(x|y)f(y)}{\int f(x|y)f(y) \, dy} \]

Frequentist vs Bayesian

  • Frequentists think about \(\theta\) as fixed and data as random.
  • Bayesians think about \(\theta\) as random and data as fixed.
  • Frequentists maximize likelihood.
  • Bayesians maximize the posterior.
  • Bayesians specify a prior distribution for the parameter \(f(\theta)\): \[ \max f(\theta|x) = \frac{f(x|\theta)f(\theta)}{f(x)} = \max \frac{L(\theta)f(\theta)}{\int L(\theta)f(\theta) \, d\theta} \]
  • The denominator has no \(\theta\), so we can ignore it.
  • Posterior = Likelihood * Prior

Markov Chains

  • A Markov Chain is a stochastic process \(X_t\) starting at \(X_0\) and making successive transitions.
  • States are \(0, 1, 2, \ldots\). The process is a Markov Chain if: \[ P(X_{t+1} = j | X_t = i, X_{t-1} = i_{t-1}, \ldots, X_0 = i_0) = P(X_{t+1} = j | X_t = i) \] for all states \(i, j\). Transition only depends on the current state.
  • A Markov Chain is irreducible if all states are accessible from all other states.
    • Given a state, there is a non-zero probability to go anywhere in finite time.
  • It is homogeneous if the transition matrix is constant.
  • Recurrent if it returns to the state with probability 1 and transient if it does not.
    • If the expected time until the chain returns to the state is finite, then the state is called non-null or positive recurrent.
  • Aperiodic if it does not cycle in a fixed number of steps.
  • Ergodic if it is aperiodic and positive recurrent (returns in finite expected time).
  • Stationary distribution:
    • For an irreducible, ergodic Markov Chain, the transition probabilities converge to a stationary distribution \(\pi\) independent of the initial state: \[ \pi = \lim_{t \to \infty} P_{ij}^n \]
    • \(\pi\) is the limiting probability of being in state \(j\).