A/B Testing and Experiment Design
Problem Context
Online controlled experiments (A/B tests) are the gold standard for measuring the causal effect of a product or model change. The goal is to isolate a single intervention and measure its effect on a metric of business interest, with enough statistical confidence to make a decision. Errors in design (under-powered tests, mis-specified metrics) or analysis (peeking, multiple comparisons) silently corrupt that causal inference.
Typical contexts: ranking model changes, pricing algorithms, UI treatments, recommendation system variants, risk model thresholds.
Core Concepts
Null and Alternative Hypotheses
- H₀ (null): the treatment has no effect; observed differences are due to chance.
- H₁ (alternative): the treatment has a nonzero (or directional) effect.
The test never “accepts H₀” — it either rejects it (p < α) or fails to reject it.
Type I and Type II Errors
| H₀ true | H₀ false | |
|---|---|---|
| Reject H₀ | Type I error (false positive, rate = α) | Correct (power = 1 − β) |
| Fail to reject H₀ | Correct | Type II error (false negative, rate = β) |
Standard settings: α = 0.05, β = 0.20 (power = 80 %). Reducing both requires a larger sample.
Statistical Significance vs Practical Significance
A result can be statistically significant (p < 0.05) but practically meaningless — a 0.001 % lift in conversion is detectable with millions of users but not worth engineering effort. Always report an effect size (Cohen’s d, relative lift, MDE) alongside the p-value.
Minimum Detectable Effect (MDE): the smallest true effect for which the test has the pre-specified power. Set the MDE based on the business threshold for action, not on what happens to be detectable with available traffic.
p-values
The p-value is the probability of observing a test statistic at least as extreme as the one computed, assuming H₀ is true. It is not the probability that H₀ is true, and it is not the probability that the result is a false positive. Misreading p-values is a primary source of incorrect decisions.
A/B vs Multivariate Testing
| Method | When to use | Trade-off |
|---|---|---|
| A/B (2 arms) | Single change, clear hypothesis | Simple; loses signal about interactions |
| A/B/n (multi-arm) | Several variants of the same change | Requires corrections for multiple comparisons |
| Multivariate (MVT) | Testing combinations of features simultaneously | Requires factorial traffic splits; hard to interpret interactions |
| Bandit (MAB) | When exploration-exploitation matters in production | Adapts allocation but invalidates classical inference |
Experiment Design Checklist
- Define the primary metric before running. A single north-star metric with a pre-specified MDE and direction (one-tailed vs two-tailed).
- Define guardrail metrics (e.g., latency, error rate, revenue) that must not regress; failure on a guardrail is a veto.
- Randomization unit: user-level (long-term effects), session-level (short-term, risk of cross-contamination), request-level (high variance).
- Power calculation: determine required sample size before starting (see §Power Analysis).
- Pre-experiment AA test: run an A/A test with the same randomization logic. A statistically significant difference in an A/A test signals instrumentation or assignment bugs.
- Holdout period and pre-period: measure baseline metrics for the same period in the prior cycle to detect instrumentation drift.
- Novelty effect window: new UI treatments often show a short-lived engagement spike. Plan a run duration that extends beyond it (typically ≥ 2 weeks for consumer products).
- Unit of diversion matches unit of analysis: if users are the unit of diversion, the analysis must aggregate to users, not to events (variance inflation if mis-matched).
- Ship criteria written before launch: what p-value threshold and absolute lift are required to ship? Decide in advance.
Statistical Framework
Two-sample z-test for proportions (conversion metrics)
Let , be control and treatment conversion rates, total sample size (equal split assumed).
where is the pooled estimate.
Reject H₀ when (two-tailed) or (one-tailed).
from scipy.stats import proportions_ztest
import numpy as np
conversions = np.array([control_conversions, treatment_conversions])
nobs = np.array([n_control, n_treatment])
stat, p_value = proportions_ztest(conversions, nobs, alternative='two-sided')
print(f"z = {stat:.3f}, p = {p_value:.4f}")Two-sample t-test for continuous metrics (revenue, engagement time)
from scipy.stats import ttest_ind
stat, p_value = ttest_ind(control_values, treatment_values, equal_var=False) # Welch's t-testMultiple comparisons correction
Testing hypotheses simultaneously inflates the family-wise error rate. Use:
- Bonferroni: — conservative, controls FWER.
- Benjamini-Hochberg: controls False Discovery Rate (FDR) at level ; less conservative, preferred when many metrics are tested.
from statsmodels.stats.multitest import multipletests
reject, pvals_corrected, _, _ = multipletests(p_values, alpha=0.05, method='fdr_bh')Power Analysis
Sample size for a two-proportion test
where:
- = baseline conversion rate
- = MDE (absolute difference, e.g., 0.01 for a 1 pp lift)
- = 1.96 (α = 0.05, two-tailed)
- = 0.84 (power = 80 %) or 1.28 (power = 90 %)
from statsmodels.stats.power import NormalIndPower, zt_ind_solve_power
# For proportions
n_per_group = zt_ind_solve_power(
effect_size=mde / np.sqrt(p_baseline * (1 - p_baseline)),
alpha=0.05,
power=0.80,
alternative='two-sided',
)
print(f"Required per group: {n_per_group:.0f}")For continuous metrics, use Cohen’s d:
from statsmodels.stats.power import TTestIndPower
analysis = TTestIndPower()
n = analysis.solve_power(
effect_size=cohens_d, # (mu_t - mu_c) / pooled_std
alpha=0.05,
power=0.80,
alternative='two-sided',
)Effect size calibration
Set the MDE from historical data and business thresholds:
- Compute the metric’s standard deviation from the past 30 days of control traffic.
- Set to the smallest lift that would change the product decision (not the smallest detectable lift).
- Higher baseline variance → larger required n → longer runtime or reduced power.
Runtime estimation
Accounts for both arms. Traffic allocation other than 50/50 increases required total n.
Sequential Testing
Classical tests assume a fixed sample size decided before peeking. Repeated peeking and stopping early on significance inflates the actual Type I error rate far above α.
Alpha spending (Lan-DeMets): pre-allocate the Type I error budget across pre-planned looks. At look out of :
where is the information fraction.
Sequential Probability Ratio Test (SPRT): continuously computes a likelihood ratio; stops when it crosses an upper bound (reject H₀) or lower bound (accept H₀). Fully sequential, no pre-planned maximum n, but requires specifying H₁ explicitly.
Always-Valid Inference (mSPRT): confidence sequences that are valid at any stopping time. Used in platforms like Statsig and Optimizely Stats Engine; allows continuous monitoring without inflation.
# Simple alpha-spending illustration (O'Brien-Fleming boundary)
from scipy.stats import norm
def obrien_fleming_boundary(t, alpha=0.05):
"""Critical value at information fraction t ∈ (0,1]."""
return norm.ppf(1 - alpha / 2) / np.sqrt(t)Common Pitfalls
P-hacking / data dredging
Stopping when p < 0.05 is reached, testing multiple metrics without correction, or running multiple segments and reporting only the significant one — all inflate the Type I error rate dramatically. Enforce pre-registration: metric, MDE, α, and runtime must be documented before the experiment starts.
Novelty and primacy effects
Users behave differently when they encounter a change for the first time (novelty) or resist changing habitual behaviour (primacy). Run experiments long enough to observe steady-state behaviour — typically ≥ 2 full weekly cycles for consumer products.
SUTVA violations (Stable Unit Treatment Value Assumption)
Assumes the treatment of one unit does not affect outcomes for other units. Violated by:
- Network effects: social platforms where treated users communicate with control users.
- Marketplace interference: pricing experiments that shift demand and affect control-arm prices.
- Shared resources: two experiment arms competing for the same cache or recommendation pool.
Mitigations: cluster randomization (randomize at the group/market level), holdout clusters, ego-network disjoint splits.
Survivorship and selection bias
Analysing only users who completed a funnel step post-randomization (e.g., only users who made a purchase) excludes users who were deterred by the treatment — the excluded population is non-random across arms.
Cookie churn / re-randomization
If the randomization ID (cookie) is volatile, users may switch arms mid-experiment, contaminating both arms. Use stable identifiers (user ID) wherever possible.
Metric mis-specification
Using a proxy metric that is easy to move but does not causally predict long-term business value. Validate proxy metrics against long-term outcomes periodically.
Underpowered tests
Declaring a result “not significant” after an under-powered test does not mean the treatment has no effect. Always report the MDE and achieved power alongside a null result.
References
- Kohavi, R., Tang, D., & Xu, Y. (2020). Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing. Cambridge University Press. Book page
- Kohavi, R. et al. (2009). Controlled experiments on the web: Survey and practical guide. Data Mining and Knowledge Discovery, 18, 140–181. DOI
- Johari, R. et al. (2015). Always valid inference: Bringing sequential analysis to A/B testing. arXiv:1512.04922
- Deng, A. et al. (2013). Improving the sensitivity of online controlled experiments by utilizing pre-experiment data. WSDM 2013. ACM
- statsmodels power analysis docs
Links
- 05 — Experimentation and Validation
- Probability and Statistics
- Evaluation and Model Selection
- Data Validation — complementary data quality gates
- Hypothesis Testing — formal test theory underlying A/B tests