Statistical Inference and Hypothesis Testing
Definition
Statistical inference draws conclusions about population parameters from sample data. Hypothesis testing is a formal decision procedure that evaluates whether observed data is consistent with a null hypothesis .
Intuition
Hypothesis testing asks: “if the null hypothesis were true, how surprising is the data I observed?” The p-value answers this: a small p-value means the data would be very rare under , providing evidence against it. A confidence interval gives a plausible range of values for the parameter, consistent with the data at a given confidence level.
Formal Description
Estimators and Properties
An estimator is a function of the sample.
- Bias:
- Variance:
- MSE:
The sample mean is an unbiased estimator of with variance .
Hypothesis Testing Framework
- State (null) and (alternative).
- Choose a test statistic whose distribution under is known.
- Compute the observed test statistic .
- Compute the p-value: (two-sided).
- Reject if (significance level, typically 0.05).
Error types:
| true | true | |
|---|---|---|
| Reject | Type I error (FP), prob. | Correct (power ) |
| Fail to reject | Correct | Type II error (FN), prob. |
Power : the probability of correctly rejecting a false .
Common Tests
One-sample -test — test whether when is unknown:
Two-sample -test (Welch) — compare means of two independent groups:
Paired -test — for paired observations, test differences as a one-sample test.
Chi-squared test of independence — test whether two categorical variables are independent:
where .
Z-test for proportions — comparing observed proportion to with large .
Confidence Intervals
A CI for with unknown :
Interpretation: if the procedure were repeated many times, 95% of such intervals would contain the true . It does not mean there is a 95% probability that lies in any one specific interval (a common misconception).
Multiple Testing
Running tests at yields false positives. Corrections:
- Bonferroni: use per test; controls family-wise error rate (FWER), very conservative.
- Benjamini-Hochberg: controls false discovery rate (FDR) at level ; more powerful than Bonferroni when is large.
Applications
- A/B testing in product and ML model comparisons
- Feature selection (test each feature’s association with target)
- Clinical trials, regulatory submissions
- Checking model residuals for normality or heteroscedasticity
Trade-offs
- Statistical vs practical significance: a large enough sample makes tiny effects significant; always report effect sizes alongside p-values.
- p-values do not give the probability that is true — only the probability of the data given .
- For online experimentation at scale, consider sequential testing (e.g., mSPRT) to avoid inflated error rates from early stopping.