Statistical Inference and Hypothesis Testing

Definition

Statistical inference draws conclusions about population parameters from sample data. Hypothesis testing is a formal decision procedure that evaluates whether observed data is consistent with a null hypothesis .

Intuition

Hypothesis testing asks: “if the null hypothesis were true, how surprising is the data I observed?” The p-value answers this: a small p-value means the data would be very rare under , providing evidence against it. A confidence interval gives a plausible range of values for the parameter, consistent with the data at a given confidence level.

Formal Description

Estimators and Properties

An estimator is a function of the sample.

  • Bias:
  • Variance:
  • MSE:

The sample mean is an unbiased estimator of with variance .

Hypothesis Testing Framework

  1. State (null) and (alternative).
  2. Choose a test statistic whose distribution under is known.
  3. Compute the observed test statistic .
  4. Compute the p-value: (two-sided).
  5. Reject if (significance level, typically 0.05).

Error types:

true true
Reject Type I error (FP), prob. Correct (power )
Fail to rejectCorrectType II error (FN), prob.

Power : the probability of correctly rejecting a false .

Common Tests

One-sample -test — test whether when is unknown:

Two-sample -test (Welch) — compare means of two independent groups:

Paired -test — for paired observations, test differences as a one-sample test.

Chi-squared test of independence — test whether two categorical variables are independent:

where .

Z-test for proportions — comparing observed proportion to with large .

Confidence Intervals

A CI for with unknown :

Interpretation: if the procedure were repeated many times, 95% of such intervals would contain the true . It does not mean there is a 95% probability that lies in any one specific interval (a common misconception).

Multiple Testing

Running tests at yields false positives. Corrections:

  • Bonferroni: use per test; controls family-wise error rate (FWER), very conservative.
  • Benjamini-Hochberg: controls false discovery rate (FDR) at level ; more powerful than Bonferroni when is large.

Applications

  • A/B testing in product and ML model comparisons
  • Feature selection (test each feature’s association with target)
  • Clinical trials, regulatory submissions
  • Checking model residuals for normality or heteroscedasticity

Trade-offs

  • Statistical vs practical significance: a large enough sample makes tiny effects significant; always report effect sizes alongside p-values.
  • p-values do not give the probability that is true — only the probability of the data given .
  • For online experimentation at scale, consider sequential testing (e.g., mSPRT) to avoid inflated error rates from early stopping.