Discrete vs Continuous Variables

In discrete distributions, the difference between \(\leq\) and \(<\) matters.
Cumulative Distribution Functions (CDFs) are not used for discrete unordered sets (like “dog”, “cat”, “none”) and for categorical random variables.

Joint and Marginal Distributions

The expected value is the integral over the domain: \[ E(g(X)) = \int g(x) p(x) \, dx \]
We assume the expected values are finite.

Variance and Standard Deviation

Variance: \[ \text{Var}(X) = E((X - \mu)^2) = E(X^2) - \mu^2 \] where \(\mu\) is fixed and not a part of the integral.
Standard deviation: \[ \sigma = \sqrt{\text{Var}(X)} \] It has the same units as \(X\).

Covariance and Correlation

Covariance: \[ \text{Cov}(X, Y) = E((X - \mu_X)(Y - \mu_Y)) = E(XY) - \mu_X \mu_Y \] where \(\mu_X\) and \(\mu_Y\) are the marginal means.
If \(X = Y\), then \(\text{Cov}(X, X) = \text{Var}(X)\).
Correlation: \[ \rho_{X,Y} = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} \] with range \((-1, 1)\).

Bayes’ Rule

Bayes’ Rule: \[ P(A \cap B) = P(A|B) P(B) = P(B|A) P(A) \]
Conditional distribution: \[ f_{X|Y=y}(X) = \frac{f_{X,Y}(X, y)}{f_Y(y)} \]

Independence

\(X\) and \(Y\) are independent if and only if for all \(x\) and \(y\), \[ f(x, y) = f(x) f(y) \]
This is equivalent to: \[ F(x, y) = F(x) F(y) \]
For more dimensions, this extends with a product.

Properties of Expectations

If \(X\) and \(Y\) are independent: \[ E(XY) = E(X) E(Y) \]
Expectation of conditional expectation: \[ E(X) = E(E(X|Y)) \]

Properties of Variance

For constants \(a\) and \(b\): \[ \text{Var}(aX + b) = a^2 \text{Var}(X) \]
Variance of the sum: \[ \text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2 \text{Cov}(X, Y) \]
If \(X\) and \(Y\) are independent, \(\text{Cov}(X, Y) = 0\).
Law of total variance: \[ \text{Var}(X) = E(\text{Var}(X|Y)) + \text{Var}(E(X|Y)) \]

Discrete Distributions

Bernoulli Distribution

Mean: \(p\)
Variance: \(p(1-p)\)
Domain: \(\{0, 1\}\)

Binomial Distribution

Probability mass function: \[ P(X = x) = \binom{n}{x} p^x (1-p)^{n-x} \]
Domain: \(x \in \{0, \ldots, n\}\)
\(X\) counts the number of successes in \(n\) independent Bernoulli trials with probability \(p\).
Mean: \(np\)
Variance: \(np(1-p)\)

Poisson Distribution

Probability mass function: \[ P(X = x) = \frac{e^{-\lambda} \lambda^x}{x!} \]
Domain: \(x \in \{0, 1, 2, \ldots\}\)
\(X\) counts the number of events occurring in a fixed interval with mean \(\lambda\) and variance \(\lambda\).

Continuous Distributions

Normal Distribution

Probability density function: \[ f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x - \mu)^2}{2\sigma^2}} \]
Domain: \(\mathbb{R}\)
Linear transformation of a normal variable is normal.
Standard normal variable: \[ Z = \frac{X - \mu}{\sigma} \sim N(0, 1) \]
Sum of independent normal variables: \[ \sum_{i=1}^n X_i \sim N\left(n\mu, n\sigma^2\right) \]
Mean and variance of sample mean of \(n\) i.i.d. normal variables: \[ \bar{X} \sim N\left(\mu, \frac{\sigma^2}{n}\right) \]

Gamma Distribution

Probability density function with shape \(\alpha\) and rate \(\beta\): \[ f(x; \alpha, \beta) = \frac{\beta^\alpha x^{\alpha-1} e^{-\beta x}}{\Gamma(\alpha)} \]
Probability density function with shape \(\alpha\) and scale \(\theta\): \[ f(x; \alpha, \theta) = \frac{x^{\alpha-1} e^{-x/\theta}}{\theta^\alpha \Gamma(\alpha)} \]
Domain: \(x \geq 0\)
The exponential distribution is a gamma distribution with shape \(\alpha = 1\).

Exponential Distribution

Probability density function with rate \(\lambda\): \[ f(x; \lambda) = \lambda e^{-\lambda x} \]
Domain: \(x \geq 0\)

Chi-Squared Distribution

A chi-squared distribution with \(k\) degrees of freedom is a gamma distribution with shape \(k/2\) and rate \(1/2\).
Sum of independent chi-squared variables is chi-squared with the sum of the degrees of freedom.

t-Distribution

\(t\)-distribution with \(v\) degrees of freedom: \[ T = \frac{Z}{\sqrt{V/v}} \] where \(Z \sim N(0, 1)\) and \(V \sim \chi^2(v)\).
The \(t\)-distribution has heavier tails than the normal distribution for small degrees of freedom and converges to the normal distribution as \(v\) increases.

Log-Normal Distribution

If \(\log(X) \sim N(\mu, \sigma^2)\), then \(X\) follows a log-normal distribution.
Probability density function: \[ f(x; \mu, \sigma) = \frac{1}{x \sigma \sqrt{2\pi}} e^{-\frac{(\log x - \mu)^2}{2\sigma^2}} \]

Beta Distribution

Probability density function with parameters \(\alpha\) and \(\beta\): \[ f(x; \alpha, \beta) = \frac{x^{\alpha-1} (1-x)^{\beta-1}}{B(\alpha, \beta)} \] where \(B(\alpha, \beta)\) is the beta function.
Alternative parameterization using shape parameters: \[ f(x; \text{shape1}, \text{shape2}) = \frac{x^{\text{shape1}-1} (1-x)^{\text{shape2}-1}}{B(\text{shape1}, \text{shape2})} \]
Domain: \(0 < x < 1\)
For \(\alpha = 1\) and \(\beta = 1\), the beta distribution is the uniform distribution on \([0, 1]\).

Multivariate Normal Distribution

Definition

The multivariate normal distribution in 2 dimensions is given by: \[ \mathbf{X} \sim N(\mathbf{\mu}, \mathbf{\Sigma}) \] where \[ \mathbf{X} = \begin{pmatrix} X_1 \\ X_2 \end{pmatrix}, \quad \mathbf{\mu} = \begin{pmatrix} \mu_1 \\ \mu_2 \end{pmatrix}, \quad \mathbf{\Sigma} = \begin{pmatrix} \sigma_{11} & \sigma_{12} \\ \sigma_{21} & \sigma_{22} \end{pmatrix} \]

Covariance Matrix

The covariance matrix \(\mathbf{\Sigma}\) looks like: \[ \mathbf{\Sigma} = \begin{pmatrix} \sigma_{11} & \sigma_{12} \\ \sigma_{12} & \sigma_{22} \end{pmatrix} \]
- If the covariance matrix is diagonal, then the variables are independent and separable.

Properties

The covariance matrix is always symmetric.
Linear transformations of a multivariate normal variable are still multivariate normal: \[ \mathbf{Y} = \mathbf{C} \mathbf{X} + \mathbf{b} \sim N(\mathbf{C} \mathbf{\mu} + \mathbf{b}, \mathbf{C} \mathbf{\Sigma} \mathbf{C}^T) \]
The dimension can change under linear transformations.

Limit Theorems

Let \(X_1, X_2, \ldots, X_n\) be i.i.d. with finite mean \(\mu\) and variance \(\sigma^2\): \[ E|X| < \infty \] then the mean of \(X_i\): \[ \bar{X}_n \text{ converges to } \mu \text{ almost surely (strong law of large numbers)} \]
For binary \(X_i\), \(\bar{X}_n\) reduces to the sample proportion \(p\). As \(n \to \infty\): \[ p \sim N\left(p, \frac{p(1-p)}{n}\right) \]
Standard error of the sample mean: \[ \frac{\sigma}{\sqrt{n}} \]

Statistics

Big \(X\) represents a random variable.
Small \(x\) represents observed data.
A statistic is a function \(T_n(X_1, X_2, \ldots, X_n)\).
The empirical distribution function (ecdf) is: \[ T_n = F_n(x) = \frac{1}{n} \sum_{i=1}^{n} I(X_i \leq x) \] which estimates \(F(x) = P(X \leq x)\).
Sample error: \[ \sqrt{\frac{F(x)(1-F(x))}{n}} \text{ with upper bound } \frac{0.5}{\sqrt{n}} \]
A quantile is the inverse of the CDF: \[ x_q = F^{-1}(q) \]
- \(x_q\) is not unique.
- \(x_q = \min \{x: F(x) \geq q\}\)
- \(q \in [0, 1]\) but \(x_q\) is in the domain.
Standard error for quantiles: \[ \text{se}(x_q) = \sqrt{\frac{q(1-q)}{n f^2(x_q)}} \] where \(f(x_q)\) is the density at \(x_q\).

Bias

A statistic \(T_n\) is an unbiased estimator of \(\theta\) if: \[ E(T_n) = \theta \]
It is asymptotically unbiased if: \[ \lim_{n \to \infty} E(T_n) = \theta \]
Sample variance is unbiased, but variance is biased.
Bias: \[ \text{Bias} = E(T_n) - \theta \]
Mean Squared Error (MSE): \[ \text{MSE} = E((T_n - \theta)^2) = \text{variance} + \text{bias}^2 \]
Standard error of estimated properties: \[ \text{se}(\hat{p}) = \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \]

Maximum Likelihood Estimation (MLE)

Probability of the observations viewed as a function of the parameter:
- If \(X_1, X_2, \ldots, X_n\) are i.i.d.: \[ L(\theta) = f(x_1, x_2, \ldots, x_n; \theta) = \prod_{i=1}^{n} f(x_i; \theta) \]
The MLE is the \(\theta\) that maximizes this.
MLE is usually done in the log form.
MLEs are invariant, meaning if \(\theta\) is the MLE of \(\theta\), then \(g(\theta)\) is the MLE of \(g(\theta)\).

Bayes’ Theorem

Law of total probability:
- If events \(A_1, A_2, \ldots, A_n\) are disjoint and cover the sample space, then: \[ P(B) = \sum_{i=1}^{n} P(A_i)P(B|A_i) \]
Bayes’ theorem: \[ P(A|B) = \frac{P(B|A)P(A)}{P(B)} \]
For continuous random variables: \[ f(y|x) = \frac{f(x|y)f(y)}{f(x)} = \frac{f(x|y)f(y)}{\int f(x|y)f(y) \, dy} \]

Frequentist vs Bayesian

Frequentists think about \(\theta\) as fixed and data as random.
Bayesians think about \(\theta\) as random and data as fixed.
Frequentists maximize likelihood.
Bayesians maximize the posterior.
Bayesians specify a prior distribution for the parameter \(f(\theta)\): \[ \max f(\theta|x) = \frac{f(x|\theta)f(\theta)}{f(x)} = \max \frac{L(\theta)f(\theta)}{\int L(\theta)f(\theta) \, d\theta} \]
The denominator has no \(\theta\), so we can ignore it.
Posterior = Likelihood * Prior

Markov Chains

A Markov Chain is a stochastic process \(X_t\) starting at \(X_0\) and making successive transitions.
States are \(0, 1, 2, \ldots\). The process is a Markov Chain if: \[ P(X_{t+1} = j | X_t = i, X_{t-1} = i_{t-1}, \ldots, X_0 = i_0) = P(X_{t+1} = j | X_t = i) \] for all states \(i, j\). Transition only depends on the current state.
A Markov Chain is irreducible if all states are accessible from all other states.
- Given a state, there is a non-zero probability to go anywhere in finite time.
It is homogeneous if the transition matrix is constant.
Recurrent if it returns to the state with probability 1 and transient if it does not.
- If the expected time until the chain returns to the state is finite, then the state is called non-null or positive recurrent.
Aperiodic if it does not cycle in a fixed number of steps.
Ergodic if it is aperiodic and positive recurrent (returns in finite expected time).
Stationary distribution:
- For an irreducible, ergodic Markov Chain, the transition probabilities converge to a stationary distribution \(\pi\) independent of the initial state: \[ \pi = \lim_{t \to \infty} P_{ij}^n \]
- \(\pi\) is the limiting probability of being in state \(j\).