Model Risk Considerations
Definition
Model risk is the potential for adverse consequences arising from decisions based on incorrect or misused models. Model risk management (MRM) encompasses policies, processes, and controls that govern model development, validation, deployment, and monitoring throughout a model’s lifecycle.
Intuition
Every deployed model is an approximation of reality built on historical data. The world can drift, the training data can be unrepresentative, and the model can be used outside its intended scope. Without governance structures, model failures may go undetected until they cause financial loss, regulatory action, reputational damage, or harm to individuals.
Formal Description
Model Risk Sources
| Source | Description |
|---|---|
| Conceptual error | Wrong modelling approach for the problem (e.g., linear model for highly non-linear phenomena) |
| Data quality | Biased, incomplete, stale, or incorrectly processed training data |
| Implementation error | Bugs in feature pipelines, scoring code, or deployment infrastructure |
| Overfitting / underfitting | Poor generalization from training to production population |
| Concept drift | Statistical distribution of inputs or $P(Y |
| Scope creep | Model used for purposes beyond its validated intended use |
Regulatory Frameworks
SR 11-7 (Federal Reserve / OCC, 2011): the foundational US regulatory guidance on model risk management for banks. Defines a model as a quantitative method with inputs, processing, and outputs. Requires:
- Model development with sound methodology, conceptual review, and documentation
- Independent model validation — separate team reviews conceptual soundness, ongoing monitoring, and outcomes analysis
- Effective challenge: validators must have sufficient authority, incentive, and expertise to critically assess models
- Governance: inventory, tiering by risk, periodic revalidation schedule
EU AI Act (2024): risk-based regulation. High-risk AI systems (credit scoring, employment, critical infrastructure, law enforcement) require conformity assessment, technical documentation, human oversight provisions, transparency to affected persons, and registration in an EU database. Prohibited AI: social scoring by public authorities, real-time biometric identification in public spaces (with exceptions).
Validation Framework
Conceptual soundness review: Does the theoretical basis support the use case? Are assumptions reasonable and documented?
Statistical backtesting: evaluate model performance on a holdout period or an out-of-time sample. Metrics depend on use case:
- Classification: AUROC, Gini coefficient, Kolmogorov-Smirnov statistic, PSI (Population Stability Index)
- Regression: RMSE, MAE, bias
Population Stability Index (PSI): measures distributional shift in input features between development and current population. . Thresholds: PSI < 0.1 (stable), 0.1–0.2 (monitor), > 0.2 (investigate/retrain).
Outcome analysis: compare actual outcomes vs. predicted; track calibration, discrimination, and performance over time segments.
Concept Drift
Covariate shift: changes but stays the same. Solution: reweigh samples or retrain. Label drift: changes (the relationship between features and outcome shifts). Requires model retraining. Harder to detect early because true labels often arrive with delay. Concept drift detection methods: Page-Hinkley test, ADWIN (Adaptive Windowing), CUSUM, drift detectors in River/scikit-multiflow.
Model Governance Lifecycle
- Development: documented methodology, data lineage, assumptions, limitations
- Independent validation: conceptual review, benchmarking, sensitivity analysis, backtesting
- Approval and tiering: risk-based classification (tier 1–3 by materiality)
- Deployment: champion/challenger setup; shadow mode evaluation before full rollout
- Ongoing monitoring: scheduled performance reports, drift alerts, exception escalation
- Periodic revalidation: triggered by material changes (new data, new use case, significant drift)
- Retirement: decommission process; retain documentation per record-keeping requirements
Applications
- Credit risk models (PD/LGD/EAD, IFRS 9 ECL models) — SR 11-7 primary domain
- Stress testing models (CCAR/DFAST)
- Any ML model used in hiring, lending, benefits eligibility, or medical decisions — EU AI Act high-risk category
- Fraud detection, AML transaction monitoring
Trade-offs
- Heavier governance slows innovation; lighter governance increases risk — calibrate by tier/materiality
- Independent validation adds cost and time but is essential for detecting implementation errors and conceptual gaps
- Automated drift monitoring reduces manual review burden but can produce false alarms requiring triage