Partial Dependence Plots, ICE, and ALE

Definition

Methods for visualizing the marginal effect of one or two features on a model’s predicted outcome, averaged (PDP) or shown per instance (ICE), or estimated via local contrasts to avoid extrapolation artifacts (ALE).

Intuition

A partial dependence plot asks: “If I fix all other features and vary only feature , how does the prediction change on average?” It marginalizes over the distribution of all other features. ICE shows the same relationship for each individual training point, revealing heterogeneity that PDP hides by averaging. ALE avoids the extrapolation problem of PDP by conditioning on small intervals rather than marginalizing globally.

Formal Description

Partial Dependence Plot (PDP)

For feature(s) and model :

where are the complement features. Implementation: grid over values; for each grid value, replace for all training instances, compute predictions, average. Two-way PDPs show interaction effects between two features.

Limitations:

  • Assumes feature independence; when features are correlated, marginalizing over creates unrealistic input combinations (e.g., forcing height=200cm with weight=40kg simultaneously)
  • Shows the average effect; heterogeneous subgroup effects are hidden
  • Computationally where is the grid size

Individual Conditional Expectation (ICE)

Plot one line per training instance: how does the prediction for instance change as varies? The PDP is the mean of all ICE curves.

Centered ICE (c-ICE): subtract each curve’s value at a reference point to highlight interaction effects: .

ICE reveals when the PDP average masks heterogeneous effects (some instances have positive, others negative marginal effects).


Accumulated Local Effects (ALE)

ALE avoids the unrealistic combinations problem by estimating feature effects using local differences within narrow data-conditional intervals:

where and are the interval boundaries and is the set of training instances falling in interval . ALE accumulates these local differences (hence the name) and centers the result to have zero mean.

ALE advantages over PDP:

  • Unbiased with correlated features
  • Faster to compute (no full grid substitution)
  • Interpretable as the local effect of changing a feature within its observed range

Applications

  • Understanding non-linear feature effects in tree models and neural networks
  • Regulatory explanations (“how does income affect predicted probability of default across its range?“)
  • Model debugging (detecting unexpected non-monotonicities or discontinuities)
  • Two-way PDPs for identifying feature interactions

Trade-offs

MethodHandles correlationShows heterogeneitySpeedExtrapolation
PDPNoNo (averaged)MediumYes (problem)
ICENoYesMediumYes (problem)
ALEYesNoFastNo
  • PDPs are widely understood and easy to explain; ALE is preferred when features are correlated
  • ICE plots can be cluttered for large datasets; subsample for visualization
  • All methods show marginal effects, not causal effects — correlational feature structure affects interpretation