Probability Theory

Definition

A mathematical framework for quantifying uncertainty. Assigns a number in $[0, 1]$ to events drawn from a sample space, obeying Kolmogorov’s axioms.

Intuition

Probability formalises the intuition that some outcomes are more likely than others. The axioms ensure probabilities are consistent: they don’t go negative, they add up to 1 across all possibilities, and disjoint events combine additively.

Formal Description

Sample space and events

Sample space $Ω$ : the set of all possible outcomes.
Event $A \subseteq Ω$ : a subset of outcomes.
Probability measure $P : F \to [0, 1]$ on a $σ$ -algebra $F$ satisfying:
1. $P (A) \geq 0$ for all $A$
2. $P (Ω) = 1$
3. Countable additivity: $P (⋃_{i = 1}^{\infty} A_{i}) = \sum_{i = 1}^{\infty} P (A_{i})$ for pairwise disjoint $A_{i}$

Derived rules

P (A^{c}) = 1 - P (A)

P (A \cup B) = P (A) + P (B) - P (A \cap B)

Conditional probability

P (A ∣ B) = \frac{P ( A \cap B )}{P ( B )}, P (B) > 0

Chain rule (product rule):

P (A \cap B) = P (A ∣ B) P (B) = P (B ∣ A) P (A)

For $n$ events: $P (A_{1} \cap \dots \cap A_{n}) = P (A_{1}) P (A_{2} ∣ A_{1}) \dots P (A_{n} ∣ A_{1}, \dots, A_{n - 1})$

Total probability

Given a partition ${B_{i}}$ of $Ω$ :

P (A) = i \sum P (A ∣ B_{i}) P (B_{i})

Bayes’ theorem

P (B_{i} ∣ A) = \frac{P ( A ∣ B _{i} ) P ( B _{i} )}{\sum _{j} P ( A ∣ B _{j} ) P ( B _{j} )}

This is the engine of Bayesian inference: $P (B_{i} ∣ A)$ is the posterior, $P (B_{i})$ is the prior, $P (A ∣ B_{i})$ is the likelihood.

Independence

$A$ and $B$ are independent iff $P (A \cap B) = P (A) P (B)$ , equivalently $P (A ∣ B) = P (A)$ .

$A$ and $B$ are conditionally independent given $C$ iff $P (A \cap B ∣ C) = P (A ∣ C) P (B ∣ C)$ .

Random variables

A random variable $X : Ω \to R$ maps outcomes to real numbers.

CDF: $F_{X} (x) = P (X \leq x)$
PMF (discrete): $p_{X} (x) = P (X = x)$
PDF (continuous): $f_{X} (x)$ such that $P (X \in [a, b]) = \int_{a}^{b} f_{X} (x) d x$

Expected value and variance

E [X] = x \sum x p (x) or \int x f (x) d x

Var (X) = E [(X - E [X])^{2}] = E [X^{2}] - (E [X])^{2}

Linearity of expectation: $E [a X + bY] = a E [X] + b E [Y]$ (no independence required).

Covariance: $Cov (X, Y) = E [(X - μ_{X}) (Y - μ_{Y})]$ .

Correlation: $ρ (X, Y) = Cov (X, Y) / (σ_{X} σ_{Y}) \in [- 1, 1]$ .

Applications

Bayesian inference: updating beliefs given new data
Machine learning: every probabilistic model (naive Bayes, GMMs, neural nets with cross-entropy) is built on these foundations
Statistical hypothesis testing (p-values rely on conditional probability)
Markov chains, hidden Markov models, graphical models

Trade-offs

The frequentist interpretation ( $P$ = long-run frequency) and the Bayesian interpretation ( $P$ = degree of belief) are philosophically distinct but mathematically identical at the level of Kolmogorov’s axioms.
Conditional probability is undefined when $P (B) = 0$ ; care is needed at measure-zero events.

Notes

Explorer

probability_theory

Probability Theory

Definition

Intuition

Formal Description

Applications

Trade-offs

Links

Graph View

Table of Contents

Backlinks