Tree Ensembles

Definition

Tree ensembles combine many decision trees to produce predictions that are more accurate and robust than any individual tree. The two dominant approaches are bagging (random forests) and boosting (gradient boosting machines).

Intuition

A single decision tree is unstable: small data changes lead to very different trees. Ensembling averages out this variance. Bagging does this in parallel by training many trees on bootstrap samples; boosting does it sequentially by having each tree correct the errors of its predecessors.

Formal Description

Decision Tree

Greedily partitions feature space by choosing splits that maximise a purity criterion:

Gini impurity (for classification):

G = 1 - k \sum p_{k}^{2}

Entropy:

H = - k \sum p_{k} lo g_{2} p_{k}

MSE (for regression): choose split minimising weighted average of child variances.

The algorithm recurses until a stopping criterion (max depth, min samples per leaf, min impurity decrease).

Key hyperparameters: max_depth, min_samples_leaf, max_features.

Inductive bias: axis-aligned splits — struggles with linear decision boundaries in general position.

Random Forest (Bagging)

For $T$ trees:

Sample $n$ rows with replacement (bootstrap sample).
At each split, consider only $m \approx d$ random features.
Grow tree to maximum depth (no pruning).
Aggregate: majority vote (classification) or mean (regression).

Variance reduction: averaging $T$ i.i.d. trees with variance $σ^{2}$ gives variance $σ^{2} / T$ if trees are uncorrelated. Feature subsampling decorrelates trees.

Out-of-bag (OOB) error: each sample is out-of-bag for ~37% of trees; provides free internal validation.

Feature importance: mean decrease in impurity (MDI) or mean decrease in accuracy via permutation. MDI is fast but biased toward high-cardinality features; permutation importance is more reliable.

Gradient Boosting

Sequentially builds $T$ trees where tree $t$ fits the pseudo-residuals of the current ensemble:

F_{t} (x) = F_{t - 1} (x) + ν \cdot h_{t} (x)

where $h_{t}$ is fit to $- \partial L / \partial F_{t - 1} (x_{i})$ (the negative gradient of the loss).

For MSE loss: pseudo-residuals = $y_{i} - F_{t - 1} (x_{i})$ (true residuals). For log-loss: pseudo-residuals = $y_{i} - σ (F_{t - 1} (x_{i}))$ (error in probability space).

Shrinkage (learning rate) $ν$ : smaller $ν$ requires more trees but generalises better.

XGBoost

XGBoost adds a second-order (Newton) approximation and explicit regularisation:

L^{(t)} \approx i \sum [g_{i} f_{t} (x_{i}) + \frac{1}{2} h_{i} f_{t}^{2} (x_{i})] + Ω (f_{t})

where $g_{i} = \partial_{\overset{y}{^}} ℓ (y_{i}, \overset{y}{^})$ , $h_{i} = \partial_{\overset{y}{^}}^{2} ℓ$ , and $Ω (f) = γ T + \frac{1}{2} λ ∥ w ∥^{2}$ penalises tree complexity.

Optimal leaf weight: $w_{j}^{*} = - G_{j} / (H_{j} + λ)$ ; optimal split gain:

Gain = \frac{1}{2} [\frac{G _{L}^{2}}{H _{L} + λ} + \frac{G _{R}^{2}}{H _{R} + λ} - \frac{( G _{L} + G _{R} ) ^{2}}{H _{L} + H _{R} + λ}] - γ

LightGBM

LightGBM uses GOSS (Gradient-based One-Side Sampling) and EFB (Exclusive Feature Bundling) for speed, plus leaf-wise (best-first) tree growth instead of level-wise:

Leaf-wise growth finds deeper, more complex splits that can overfit on small datasets; mitigate with min_child_samples and num_leaves.
LightGBM is typically 10–20× faster than vanilla GBDT for large datasets.

Hyperparameter Tuning Guidance

Hyperparameter	Effect	Range
`n_estimators`	More trees = better up to diminishing returns	100–5000
`max_depth` (XGB/RF)	Depth of individual trees	3–8
`num_leaves` (LGB)	Max leaves (leaf-wise)	31–255
`learning_rate` (GBM)	Shrinkage; lower = better generalisation	0.01–0.3
`subsample`	Row subsampling per tree (stochastic GBM)	0.5–1.0
`colsample_bytree`	Column subsampling per tree	0.5–1.0
`min_child_weight`	Min sum of Hessians in leaf (regularisation)	1–300
`reg_alpha/lambda`	L1/L2 regularisation on leaf weights	0–10

Applications

Tabular ML competitions: GBMs dominate (XGBoost, LightGBM, CatBoost)
Insurance: claim frequency/severity prediction; risk segmentation
Fraud detection: high recall at low FPR
Feature importance for regulatory model documentation

Trade-offs

Model	Pros	Cons
Decision tree	Interpretable, fast	High variance, low accuracy
Random forest	Low variance, robust, OOB estimate	Slower inference, less accurate than GBM
GBM (XGB/LGB)	State-of-art tabular accuracy	Slow training, many hyperparameters, overfits with few data

Notes

Explorer

tree_ensembles