Data Validation

Definition

The systematic verification that data meets expected structural and statistical properties before it enters a model pipeline, preventing silent failures from corrupt, shifted, or malformed inputs.

Intuition

Training a model on bad data creates a model that is confidently wrong. Data validation adds a quality gate: the pipeline asserts that inputs conform to a schema, that distributions match expectations, and that no forbidden values or unexpected patterns are present. Catching problems early is far cheaper than debugging a misbehaving model in production.

Formal Description

Schema Validation

Assert column types, presence, and allowed values:

import pandera as pa
 
schema = pa.DataFrameSchema({
    'age':        pa.Column(float, pa.Check.between(18, 120)),
    'income':     pa.Column(float, pa.Check.greater_than_or_equal_to(0)),
    'gender':     pa.Column(str,   pa.Check.isin(['M', 'F', 'Unknown'])),
    'claim_flag': pa.Column(int,   pa.Check.isin([0, 1])),
})
 
validated_df = schema.validate(df)  # raises SchemaError on failure

Statistical Distribution Checks

Verify that feature distributions match training baseline.

KS test — two-sample test for distribution shift:

from scipy.stats import ks_2samp
 
stat, p_value = ks_2samp(train['age'], new_data['age'])
# p < 0.05: statistically significant shift

Population Stability Index (PSI) — widely used in banking/insurance:

PSI = i = 1 \sum B (A_{i} - E_{i}) ln \frac{A _{i}}{E _{i}}

where $A_{i}$ and $E_{i}$ are actual and expected proportions in bin $i$ .

PSI value	Interpretation
< 0.1	No significant shift
0.1 – 0.25	Moderate shift, investigate
> 0.25	Major shift, do not use model

Numeric range checks:

for col, (min_val, max_val) in expected_ranges.items():
    assert df[col].between(min_val, max_val).all(), f"{col} out of range"

Missing Value Checks

for col, max_missing_rate in missing_thresholds.items():
    actual = df[col].isnull().mean()
    assert actual <= max_missing_rate, f"{col}: {actual:.1%} missing > threshold {max_missing_rate:.1%}"

Referential and Business Rule Checks

# Date ordering
assert (df['end_date'] > df['start_date']).all()
 
# Non-negative values
assert (df['premium'] >= 0).all()
 
# No duplicate primary keys
assert df['policy_id'].is_unique

Great Expectations (Framework)

Great Expectations is a declarative data validation framework:

import great_expectations as gx
 
context = gx.get_context()
suite = context.add_expectation_suite("my_suite")
 
# Add expectations
suite.expect_column_values_to_not_be_null("age")
suite.expect_column_values_to_be_between("age", min_value=18, max_value=120)
suite.expect_column_values_to_be_in_set("gender", ["M", "F", "Unknown"])
 
# Validate
results = context.run_validation_operator("action_list_operator", ...)

Expectations can be auto-generated by profiling training data, then serve as contracts for new batches.

Applications

Gate new batch data before retraining: reject batches with PSI > 0.25 on key features.
Validate model inputs at serving time: return a fallback prediction or error when inputs are out-of-distribution.
CI/CD data tests: run validation suite as part of the ML pipeline as a quality gate step.

Trade-offs

Strict validation can cause unnecessary failures when legitimate distributional shifts occur (e.g., seasonal patterns); tune thresholds carefully.
Validation is not a substitute for monitoring in production — it checks inputs, not model outputs.

Notes

Explorer

data_validation