Tree Ensembles

Definition

Tree ensembles combine many decision trees to produce predictions that are more accurate and robust than any individual tree. The two dominant approaches are bagging (random forests) and boosting (gradient boosting machines).

Intuition

A single decision tree is unstable: small data changes lead to very different trees. Ensembling averages out this variance. Bagging does this in parallel by training many trees on bootstrap samples; boosting does it sequentially by having each tree correct the errors of its predecessors.

Formal Description

Decision Tree

Greedily partitions feature space by choosing splits that maximise a purity criterion:

Gini impurity (for classification):

Entropy:

MSE (for regression): choose split minimising weighted average of child variances.

The algorithm recurses until a stopping criterion (max depth, min samples per leaf, min impurity decrease).

Key hyperparameters: max_depth, min_samples_leaf, max_features.

Inductive bias: axis-aligned splits — struggles with linear decision boundaries in general position.

Random Forest (Bagging)

For trees:

  1. Sample rows with replacement (bootstrap sample).
  2. At each split, consider only random features.
  3. Grow tree to maximum depth (no pruning).
  4. Aggregate: majority vote (classification) or mean (regression).

Variance reduction: averaging i.i.d. trees with variance gives variance if trees are uncorrelated. Feature subsampling decorrelates trees.

Out-of-bag (OOB) error: each sample is out-of-bag for ~37% of trees; provides free internal validation.

Feature importance: mean decrease in impurity (MDI) or mean decrease in accuracy via permutation. MDI is fast but biased toward high-cardinality features; permutation importance is more reliable.

Gradient Boosting

Sequentially builds trees where tree fits the pseudo-residuals of the current ensemble:

where is fit to (the negative gradient of the loss).

For MSE loss: pseudo-residuals = (true residuals). For log-loss: pseudo-residuals = (error in probability space).

Shrinkage (learning rate) : smaller requires more trees but generalises better.

XGBoost

XGBoost adds a second-order (Newton) approximation and explicit regularisation:

where , , and penalises tree complexity.

Optimal leaf weight: ; optimal split gain:

LightGBM

LightGBM uses GOSS (Gradient-based One-Side Sampling) and EFB (Exclusive Feature Bundling) for speed, plus leaf-wise (best-first) tree growth instead of level-wise:

  • Leaf-wise growth finds deeper, more complex splits that can overfit on small datasets; mitigate with min_child_samples and num_leaves.
  • LightGBM is typically 10–20× faster than vanilla GBDT for large datasets.

Hyperparameter Tuning Guidance

HyperparameterEffectRange
n_estimatorsMore trees = better up to diminishing returns100–5000
max_depth (XGB/RF)Depth of individual trees3–8
num_leaves (LGB)Max leaves (leaf-wise)31–255
learning_rate (GBM)Shrinkage; lower = better generalisation0.01–0.3
subsampleRow subsampling per tree (stochastic GBM)0.5–1.0
colsample_bytreeColumn subsampling per tree0.5–1.0
min_child_weightMin sum of Hessians in leaf (regularisation)1–300
reg_alpha/lambdaL1/L2 regularisation on leaf weights0–10

Applications

  • Tabular ML competitions: GBMs dominate (XGBoost, LightGBM, CatBoost)
  • Insurance: claim frequency/severity prediction; risk segmentation
  • Fraud detection: high recall at low FPR
  • Feature importance for regulatory model documentation

Trade-offs

ModelProsCons
Decision treeInterpretable, fastHigh variance, low accuracy
Random forestLow variance, robust, OOB estimateSlower inference, less accurate than GBM
GBM (XGB/LGB)State-of-art tabular accuracySlow training, many hyperparameters, overfits with few data