Data Validation
Definition
The systematic verification that data meets expected structural and statistical properties before it enters a model pipeline, preventing silent failures from corrupt, shifted, or malformed inputs.
Intuition
Training a model on bad data creates a model that is confidently wrong. Data validation adds a quality gate: the pipeline asserts that inputs conform to a schema, that distributions match expectations, and that no forbidden values or unexpected patterns are present. Catching problems early is far cheaper than debugging a misbehaving model in production.
Formal Description
Schema Validation
Assert column types, presence, and allowed values:
import pandera as pa
schema = pa.DataFrameSchema({
'age': pa.Column(float, pa.Check.between(18, 120)),
'income': pa.Column(float, pa.Check.greater_than_or_equal_to(0)),
'gender': pa.Column(str, pa.Check.isin(['M', 'F', 'Unknown'])),
'claim_flag': pa.Column(int, pa.Check.isin([0, 1])),
})
validated_df = schema.validate(df) # raises SchemaError on failureStatistical Distribution Checks
Verify that feature distributions match training baseline.
KS test — two-sample test for distribution shift:
from scipy.stats import ks_2samp
stat, p_value = ks_2samp(train['age'], new_data['age'])
# p < 0.05: statistically significant shiftPopulation Stability Index (PSI) — widely used in banking/insurance:
where and are actual and expected proportions in bin .
| PSI value | Interpretation |
|---|---|
| < 0.1 | No significant shift |
| 0.1 – 0.25 | Moderate shift, investigate |
| > 0.25 | Major shift, do not use model |
Numeric range checks:
for col, (min_val, max_val) in expected_ranges.items():
assert df[col].between(min_val, max_val).all(), f"{col} out of range"Missing Value Checks
for col, max_missing_rate in missing_thresholds.items():
actual = df[col].isnull().mean()
assert actual <= max_missing_rate, f"{col}: {actual:.1%} missing > threshold {max_missing_rate:.1%}"Referential and Business Rule Checks
# Date ordering
assert (df['end_date'] > df['start_date']).all()
# Non-negative values
assert (df['premium'] >= 0).all()
# No duplicate primary keys
assert df['policy_id'].is_uniqueGreat Expectations (Framework)
Great Expectations is a declarative data validation framework:
import great_expectations as gx
context = gx.get_context()
suite = context.add_expectation_suite("my_suite")
# Add expectations
suite.expect_column_values_to_not_be_null("age")
suite.expect_column_values_to_be_between("age", min_value=18, max_value=120)
suite.expect_column_values_to_be_in_set("gender", ["M", "F", "Unknown"])
# Validate
results = context.run_validation_operator("action_list_operator", ...)Expectations can be auto-generated by profiling training data, then serve as contracts for new batches.
Applications
- Gate new batch data before retraining: reject batches with PSI > 0.25 on key features.
- Validate model inputs at serving time: return a fallback prediction or error when inputs are out-of-distribution.
- CI/CD data tests: run validation suite as part of the ML pipeline as a quality gate step.
Trade-offs
- Strict validation can cause unnecessary failures when legitimate distributional shifts occur (e.g., seasonal patterns); tune thresholds carefully.
- Validation is not a substitute for monitoring in production — it checks inputs, not model outputs.