Statistical Inference and Hypothesis Testing

Definition

Statistical inference draws conclusions about population parameters from sample data. Hypothesis testing is a formal decision procedure that evaluates whether observed data is consistent with a null hypothesis $H_{0}$ .

Intuition

Hypothesis testing asks: “if the null hypothesis were true, how surprising is the data I observed?” The p-value answers this: a small p-value means the data would be very rare under $H_{0}$ , providing evidence against it. A confidence interval gives a plausible range of values for the parameter, consistent with the data at a given confidence level.

Formal Description

Estimators and Properties

An estimator $\hat{θ} (X_{1}, \dots, X_{n})$ is a function of the sample.

Bias: $Bias (\hat{θ}) = E [\hat{θ}] - θ$
Variance: $Var (\hat{θ})$
MSE: $MSE (\hat{θ}) = Bias^{2} + Var$

The sample mean $\overset{ˉ}{X} = \frac{1}{n} \sum_{i} X_{i}$ is an unbiased estimator of $μ$ with variance $σ^{2} / n$ .

Hypothesis Testing Framework

State $H_{0}$ (null) and $H_{1}$ (alternative).
Choose a test statistic $T (X_{1}, \dots, X_{n})$ whose distribution under $H_{0}$ is known.
Compute the observed test statistic $t_{obs}$ .
Compute the p-value: $P (∣ T ∣ \geq ∣ t_{obs} ∣ ∣ H_{0})$ (two-sided).
Reject $H_{0}$ if $p < α$ (significance level, typically 0.05).

Error types:

	$H_{0}$ true	$H_{1}$ true
Reject $H_{0}$	Type I error (FP), prob. $α$	Correct (power $1 - β$ )
Fail to reject	Correct	Type II error (FN), prob. $β$

Power $= 1 - β$ : the probability of correctly rejecting a false $H_{0}$ .

Common Tests

One-sample $t$ -test — test whether $μ = μ_{0}$ when $σ^{2}$ is unknown:

t = \frac{X ˉ - μ _{0}}{s / n} \sim t_{n - 1} under H_{0}

Two-sample $t$ -test (Welch) — compare means of two independent groups:

t = \frac{X ˉ _{1} - X ˉ _{2}}{s _{1}^{2} / n _{1} + s _{2}^{2} / n _{2}}

Paired $t$ -test — for paired observations, test differences $D_{i} = X_{1 i} - X_{2 i}$ as a one-sample test.

Chi-squared test of independence — test whether two categorical variables are independent:

χ^{2} = i, j \sum \frac{( O _{ij} - E _{ij} ) ^{2}}{E _{ij}} \sim χ_{(r - 1) (c - 1)}^{2} under H_{0}

where $E_{ij} = (row total_{i}) (col total_{j}) / n$ .

Z-test for proportions — comparing observed proportion to $p_{0}$ with large $n$ .

Confidence Intervals

A $95%$ CI for $μ$ with unknown $σ$ :

\overset{ˉ}{X} \pm t_{n - 1, 0.025} \cdot \frac{s}{n}

Interpretation: if the procedure were repeated many times, 95% of such intervals would contain the true $μ$ . It does not mean there is a 95% probability that $μ$ lies in any one specific interval (a common misconception).

Multiple Testing

Running $m$ tests at $α = 0.05$ yields $\approx 0.05 m$ false positives. Corrections:

Bonferroni: use $α / m$ per test; controls family-wise error rate (FWER), very conservative.
Benjamini-Hochberg: controls false discovery rate (FDR) at level $α$ ; more powerful than Bonferroni when $m$ is large.

Applications

A/B testing in product and ML model comparisons
Feature selection (test each feature’s association with target)
Clinical trials, regulatory submissions
Checking model residuals for normality or heteroscedasticity

Trade-offs

Statistical vs practical significance: a large enough sample makes tiny effects significant; always report effect sizes alongside p-values.
p-values do not give the probability that $H_{0}$ is true — only the probability of the data given $H_{0}$ .
For online experimentation at scale, consider sequential testing (e.g., mSPRT) to avoid inflated error rates from early stopping.

Notes

Explorer

hypothesis_testing