Churn Prediction

Problem

Identify subscription customers who are at elevated risk of cancelling within a defined forward window (typically 30, 60, or 90 days) so that the business can intervene with retention offers, proactive customer success contact, or product improvements. Churn has high economic impact: acquiring a new customer costs 5–25× retaining an existing one, and CLV models depend heavily on retention rate assumptions.

The problem has two distinct variants:

  1. Voluntary churn: Customer actively cancels or fails to renew — addressable through intervention
  2. Involuntary churn: Payment failure (credit card expiry, insufficient funds) — addressable through payment recovery workflows

Most churn models target voluntary churn at the individual customer level.

Users / Stakeholders

RoleDecision driven by churn score
Customer success managerPrioritise outreach queue; decide which at-risk accounts to call
Marketing managerDesign and trigger retention email/SMS campaigns
Product managerIdentify product friction signals correlated with churn
FinanceRevenue forecasting; cohort-based churn rate projections
ExecutiveBoard-level NRR (Net Revenue Retention) reporting

The primary consumer is an operational CRM system (Salesforce, HubSpot) that triggers automated workflows or populates a human review queue. Scores need to be refreshed weekly or daily depending on intervention latency.

Domain Context

  • SaaS/subscription specifics: Churn only observable at renewal date or explicit cancel event. Right-censoring: active customers who haven’t yet churned are not churned yet — don’t label them as non-churn until renewal window closes.
  • Class imbalance: Monthly churn rates 2–8% in B2C, 0.5–3% in B2B SaaS. Severe class imbalance requires calibration. Precision-recall tradeoff dominates AUC as primary metric.
  • Intervention validity window: Contacting a customer who is 5 minutes from churning is too late. The model must predict far enough in advance for intervention to be effective (typically 30–90 days).
  • GDPR / data minimisation: Behavioural data (session logs, feature usage) may require explicit consent or legitimate interest basis. Some markets restrict automated individual-level retention scoring without human oversight.
  • B2B vs B2C: B2B churn is account-level, influenced by champion departure (key user leaves), contract expansion/contraction, organisational changes. Multiple users per account — aggregate signals needed.

Inputs and Outputs

Feature categories:

Recency: days_since_last_login, days_since_last_purchase, days_to_renewal
Frequency: logins_last_30d, sessions_last_90d, feature_use_count
Monetary: mrr, total_spend_12m, discount_depth, payment_method_type
Engagement: nps_score, support_tickets_open, tickets_last_30d, email_open_rate
Product usage: feature_A_activations, api_calls, reports_generated, integrations_active
Account health: seats_utilised / seats_purchased, admin_last_login, sso_enabled
Lifecycle: months_since_signup, plan_type, last_plan_change_direction, contract_end_date

Output:

churn_probability:   P(churn within 30/60/90 days) ∈ [0, 1]
churn_risk_tier:     HIGH / MEDIUM / LOW  (operational segmentation)
top_churn_reasons:   SHAP top-3 features driving prediction (for CSM context)
recommended_action:  RETENTION_CALL / EMAIL_CAMPAIGN / NO_ACTION

Decision or Workflow Role

Weekly batch score refresh (or daily for high-value accounts)
  ↓
Scores pushed to CRM (Salesforce / HubSpot) via API
  ↓
HIGH risk (P > 0.4) → CS team action queue + manager alert
MEDIUM risk → automated email sequence triggered
LOW risk → no action
  ↓
Intervention outcome logged → feedback into training data
  ↓
Holdout control group → measure lift of intervention

Churn prediction is an input to a tiered intervention strategy. The economic value is measured as revenue saved by successful retention, minus intervention cost.

Modeling / System Options

ApproachStrengthWeaknessWhen to use
Logistic regression + feature engineeringHighly interpretable, fast, calibrated by defaultMisses interaction effectsEarly stage, limited data, high interpretability requirement
LightGBM / XGBoostHandles interactions, missing values, fast at scaleRequires calibration; less interpretable without SHAPStandard production choice; tabular data
Survival analysis (Cox PH)Models time-to-event properly; handles right-censoringProportional hazards assumption; harder to operationaliseWhen timing of churn matters as much as whether it happens
Neural embedding modelCaptures sequence of events; cold-start via embeddingsData-hungry; complex deploymentHigh data volume; when event sequence is the key signal
Rules-based + ML hybridInstant wins from obvious patterns; explainableFragile; rule proliferationRegulated environments where black boxes not permitted

Recommended: LightGBM with SHAP explanations for score reasoning. Survival analysis (Kaplan-Meier + Cox PH) for cohort-level churn rate forecasting in finance reporting. Calibrate with Platt scaling.

See Tree Ensembles Implementation for code.

Deployment Constraints

  • Latency: Batch weekly/daily — not real-time. Inference time not a constraint.
  • Interpretability: CSMs need to understand why a customer is flagged. SHAP top-3 features per customer are mandatory. “Model says so” is not actionable.
  • Calibration: Scores are shown to business users as probabilities. A poorly calibrated model that outputs 0.8 for 40% of customers destroys trust. Evaluate and enforce calibration via reliability diagrams.
  • Fairness: Scores must not systematically disadvantage customers based on demographic proxies (geography, language, company size) in ways that would constitute discriminatory service.
  • Volume: Typically 10K–1M customer records. Sub-second batch inference. CRM integration via REST API or Salesforce batch upload.
  • Model update cadence: Retrain monthly on new labels. Deploy when validation AUC-PR ≥ current champion model.

Risks and Failure Modes

RiskDescriptionMitigation
Label leakageUsing features computed after churn eventStrict point-in-time cutoff; validate with time-based split
Selection bias in interventionsPast interventions change who would have churned — biases labelsTrack who received interventions; use counterfactual evaluation
Right-censoringActive customers who will churn in future labeled as non-churnSurvival analysis; time-windowed binary labels only
Survivorship biasHigh-engagement customers are over-represented in old cohortsCohort-stratified training data
Intervention fatigueOver-contacting medium-risk customers reduces campaign effectivenessScore-gated throttle in CRM; holdout experiment design
Class imbalanceLow churn rate → model optimises for majority classUse AUC-PR not AUC-ROC; class weights; calibration post-training
Proxy discriminationSmall company size or geography as churn proxy → unfair serviceFairness audit; remove sensitive proxies; equalised odds check

Success Metrics

MetricTargetNotes
AUC-PR> 0.45 (vs base rate ~0.05)Better than AUC-ROC for imbalanced problems
Precision@top 10%> 2× lift over randomAre the highest-risk customers actually churning?
Calibration (ECE)< 0.05Scores must be usable as actual probabilities
Revenue retained> 3× intervention costTrue business ROI; requires experiment with holdout
CSM adoption rate> 70% act on flagged accountsModel utility metric; low adoption means scores aren’t trusted
Churn rate reduction5–20% relative reductionUltimate KPI; may take 6–12 months to measure reliably

References

  • Hadden, J. et al. (2007). Computer assisted customer churn management: State-of-the-art and future trends. Computers & Operations Research.
  • Verbeke, W. et al. (2012). New insights into churn prediction in the telecommunication sector: A profit driven data mining approach. European Journal of Operational Research.

Modeling

ML Engineering

Reference Implementations

Adjacent Applications